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ABSTRACT 


t 

The  current  training  evaluation  and  student  measurement  literature  is  reviewed.  The 
emphasis  is  on  studies  which  have  been  reported  in  the  last  ten  years,  although  earlier 
studies  which  have  impacted  he  ivily  on  recent  trends  are  also  included.  Because  of  the 
obvious  interaction  between  bo.h  training  evaluation  and  student  measurement,  on  the 
one  hand,  and  such  topics  as  statistical  methods,  methods  for  course  development, 
training  methods,  learning  styles,  motivation,  and  moderator  variables,  on  the  other  hand, 
these  and  similar  considerations  are  also  included. 


SUMMARY 


Bergman,  B.A.,  &  Siegel,  A.l.  Training  evaluation  and  student  achievement  measurement:  A  review  of  the 
literature.  AFHRL-TR-72-3.  Lowry  AFB,  Colo.:  Technical  Training  Division,  Air  Force  Human 
Resources  Laboratory,  January  1972. 

Problem 

The  purpose  of  this  paper  is  to  review  the  training  evaluation  and  student  achievement  measurement 
literature  with  primary  emphasis  being  placed  on  studies  reported  in  the  last  ten  years. 

Approach 

Recent  trends  in  training  evaluation  and  student  achie  .v’ment  measurement  are  presented.  Because  of 
the  obvious  interaction  between  both  training  evaluation  and  student  measurement,  on  the  one  hand,  and 
such  topics  as  statistical  methods,  course  development  methods,  training  techniques,  learning  styles, 
motivation,  and  moderator  variables,  on  the  other  hand,  these  and  similar  considerations  are  also  included. 

Results 

Whc:«  new  methods  of  training  evaluation  and  student  achievement  measurement  appeared  in  the 
literature,  detailed  presentations  were  given.  Among  these  procedures  were  cost-effectiveness  or  cost-benefit 
analysis,  criterion-referenced  testing,  sequential  testing,  confidence  testing,  convergent  and  discriminant 
validity,  and  computer  assisted  branched  testing. 

Conclusions 

Systematic  approaches  to  evaluation  and  course  development  are  receiving  more  and  more  attention. 
Most  systems  begin  with  a  job  analysis  in  order  to  derive  a  list  of  behaviorally  oriented  job  requirements 
from  which  training  objectives  can  be  formulated.  The  new  techniques  in  evaluation  and  measurement  have 
resulted  from  attempts  to  determine  whether  training  objectives  have  been  realized. 

This  summary  was  prepared  by  Wayne  S.  Sellman,  Technical  Training  Division,  Air  Force  Human 
Resources  Laboratory. 
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TRAiNING  EVALUATION  AND  STUDENT  ACHIEVEMENT  MEASUREMENT: 
A  REVIEW  OF  THE  LITERATURE 


I.  INTRODUCTION 

Methods  and  procedures  for  evaluating  training 
courses  and  student  achievement  have  been  slowly 
evolving  and  assuming  increased  stature  within  any 
training  program  developmental  paradigm  which 
aims  to  be  at  all  complete.  This  inci  eased  emphasis 
on  training  evaluation  and  student  measurement  is 
due,  in  part,  to  the  increased  realization  that  there 
can  be  no  training  system  without  quality  control. 
Training  in  this  sense  is  viewed  as  a  process 
(analogous  to  a  chemical  or  manufacturing 
process)  in  which  raw  material  (students)  is 
converted  from  one  form  to  another  (skilled 
craftsmen).  Within  such  a  construct,  there  must  be 
a  quality  control  stage;  training  evaluation  and 
student  measurement  represent  the  quality  control 
stage  in  the  training  process. 

This  report  selectively  reviews  the  current 
literature  related  to  training  evaluation  and 
student  achievement  measurement.  The  review 
period  extends  over  the  20  years  preceding  1970, 
although  the  emphasis  is  not  evenly  apportioned 
throughout  the  entire  span.  The  first  ten  years  of 
ihc  period  arc  only  briefly  covered.  Advances  of 
the  last  decade  indicate  that,  except  for  historical 
perspective,  the  1950  to  1960  time  frame  should 
be  treated  rather  lightly  in  a  review  such  as  this. 
Air  Force  flight  equipment  of  the  Korean  War  and 
immediate  post-Korean  War  era  is  today  looked 
hpoh  as  vintage  equipment.  Ten  years  ago,  the 
digital  computer,  systems  thinking,  and 
programmed  instruction  were  in  their  virtual 
infancy;  and  computer  assisted  training,  T-group 
training,  and  behavior  modification  were  all  things 
of  the  future.  Accordingly,  the  first  decade  of  the 
review  period  liar  received  only  modest  emphasis. 

The  heavier  emphasis  in  this  review  is  the  recent 
ten  years,  witli  the  last  five  being  most  thoroughly 
covered. -The  goal  was  to  examine  the  subject 
matter  areas  but,  most  importantly,  to  determine 
for  future  reference,  the  answers  to  the  questions 
“what  is  new  in  training  evaluation?”  and  "what  is 
new  in  student  achievement  measurement?”  With 
these  principal  goals,  placement  of  heaviest 
emphasis  on  the  most  contemporary  time  period 
seems  clearly  indicated. 


Sources  Searched 

In  order  to  identify  relevant  literature,  the 
following  sources  v/ere  searched:  Psychological 
Abstracts,  Technical  Abstract  Bulletins  of  the 
Defense  Documentation  Center,  and  the  U.  S. 
Government  Research  and  Development  Reports, 
published  by  the  Department  of  Commerce. 

The  Psychological  Abstracts  were  reviewed 
from  Number  1  of  the  1966  volume  through 
Number  4  of  the  1971  volume,  thus  affording 
entry  to  the  literature  of  the  1965-1970  period. 
The  topics  covered  were  Education  and  Training  in 
the  General  section;  Testing  in  the  Methodology 
and  Research  Technology  section;  Testing, 
Counseling  and  Guidance,  Teachers  and  Teacher 
Training,  School  Learning  and  Achievement  in  the 
Educational  Psychology  section;  and  Vocational 
Choice  and  Guidance,  Selection,  and  Placement, 
and  Training,  in  the  Personnel  and  Industrial 
Psychology  section. 

The  Technical  Abstract  Bulletins  were  reviewed 
from  Number  1  of  the  1966  index  volume  to 
Number  24  of  the  1970  volume.  The  topics 
searched  in  .Iicsc  index  volumes  were  Evaluation. 
Performance,  Personnel,  and  Testing. 

The  U.  S.  Government  Research  and  Develop¬ 
ment  Reports  teviewed  were  from  issue  Number  1 
of  1 968  to  Number  1 2  of  1 97 1 .  The  major  subject 
field  searched  was  Behavioral  and  Social  Sciences; 
the  specific  subficlds  examined  were  Human 
Factors  Engineering,  Man-Machine  Relations, 
Personnel  Selection,  Training  and  Evaluation,  and 
Psychology  (individual  and  Group  Behavior). 

In  addition  to  these  systematic  searches  of 
source  listings,  the  act  of  reading  in  the  literature 
unearthed  other  literature  of  relevance.  Partic¬ 
ularly  valuable  in  suggesting  articles  and  books  of 
importance  were  issues  of  the  Psychological 
Bulletin  and  appropriate  chapters  of  the  Annual 
Review  of  Psychology.  Thus,  as  a  result  of  the 
systematic  examination  of  three  listing  sources, 
the  utilization  of  other  review  and  discussion 
articles  which  integrated  much  of  the  thinking  in 
the  subject  fields,  and  the  normal  reading  of  the 
published  materials  of  these  fields,  a  degree  of 
confidence  can  be  manifested  in  the  compre¬ 
hensiveness  of  the  coverage,  of  this  review. 
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Training  Evaluation  and  Student 
Achievement  Measurement 

Training  evaluation  and  student  achievement 
measurement  in  some  ways  involve  similar  con¬ 
structs,  and  in  some  ways  they  involve  different 
constructs.  Moreover,  several  different  meanings 
have  been  attached  to  the  ferm  “training  evalua¬ 
tion.” 

There  are  at  least  three  major  and  quite 
different  reasons  for  measuring  student  achieve¬ 
ment.  The  most  time-honored  of  these  is  for 
determining  whether  the  student  has  mastered  the 
prescribed  subject  matter  and,  hence,  can  be 
promoted,  graduated,  certified,  licensed,  or  in 
some  other  way  acknowledged.  This  type  of 
student  measurement  takes  place  for  purposes  of 
evaluating  the  student;  and  it  is  completely 
distinct  from  evaluating  the  training  provided  to 
the  student,  or  from  other  reasons  for  student 
measurement. 

A  second  reason  for  student  measurement  is  to 
determine  his  subject  matter  areas  of  strength  and 
weakness  for  reinforcement  and  feedback  purposes 
and  for  diagnosis  and  subsequent  remedial  action. 
Many  automated,  or  programmed,  instructional 
texts  and  devices  provide  for  this  type  of  measure¬ 
ment,  as  do  most  good  tutors.  This  student 
measurement  is  an  instructional  technique,  and  it 
is  completely  distinct  from  evaluating  either  the 
student  or  the  training. 

Finally,  student  measurement  is  employed  for 
purposes  of  drawing  inferences  about  the  effective¬ 
ness  of  the  instruction  provided  to  the  student. 
Other  things  being  equal,  it  can  be  inferred  that 
the  more  the  students  have  achieved,  the  better 
the  quality  of  the  instruction.  Student  achieve¬ 
ment  in  this  case  is,  indeed,  a  method  of  training 
evaluation.  In  only  one,  then,  of  the  three  uses  of 
student  measurement  doca  student  measurement 
overlap  the  topic  of  training  evaluation.  In  the 
other  two  uses,  student  measurement  is  a  distinct 
topic  of  interest  without  any  necessary  reference 
to  training  evaluation. 

The  term  training  evaluation  also  has  multiple 
meanings  and  has  been  applied  in  a  number  of 
different  contexts.  At  a  minimum,  one  should 
distinguish  comparative  or  relative  training  evalua¬ 
tions  from  more  absolute  evaluations  of  training. 
The  first  case  involves  the  determination  of  which 
is  best  among  a  number  of  methods  or  programs 
for  presenting  the  training  content.  The  second 
case  involves  determination  of  how  good  the 
training  is. 

In  addition  to  the  obvious  syllogistic  point  that 
a  particular  program  may  be  the  best  and  yet  not 


be  very  good,  the  relative  or  absolute  distinction 
has  other  implications  for  this  review.  The  time 
frame  covered  has  seen  exceedingly  rapid  accelera¬ 
tion  in  the  rate  of  development  of  new 
instructional  methods.  From  Prcssey  and  Skinner’s 
early  teaching  machines,  to  a  number  of  different 
approaches  to  programmed  texts,  to  computer 
assisted  instruction,  the  “traditional”  classroom 
has  probably  undergone  more  of  a  metamorphosis 
in  this  relatively  brief  time  period  than  in  all  of  its 
preceding  years.  And,  with  each  new  development, 
a  multitude  of  evaluations  comparing  it  either  to 
traditional  methods  or  to  the  last  new  develop¬ 
ment  have  appeared.  The  result  has  been  a 
literature  very  full  of  comparative  training 
evaluations.  No  attempt  has  been  made  to  discuss 
more  than  a  sample  of  these  comparative  evalua¬ 
tions.  To  do  more  would  overbalance  the  review 
with,  in  many  cases,  rather  trivial  studies. 

The  major  thrust  of  this  review  is  on  systems, 
quantitative  methods,  and  evaluations  of  training 
which  have  utilized  more  absolute  criteria.  Such 
studies  have  maximum  import  for  the  quality 
control  stage  within  an  instructional  system.  This 
quality  control  stage  in  an  Air  Force  context  is 
concerned  with  how  well  students  arc  prepared  for 
job  performance,  not  whether  the  Air  Force’s 
method  is  better  or  worse  than  someone  clsc’s. 


II.  DIMENSIONS  OF  EVALUATION 

Roles,  Uses,  and  Characteristics 
of  Evaluation 

Stake  (1969)  and  his  associates  (Stake  & 
Denny,  1969)  differentiate  between  evaluation 
and  scientific  research,  while  admitting  that  both 
can  overlap.  Stake  indicates  that  evaluation  studies 
are  concerned  with  worth  or  value  while  research 
studies  arc  rarely  concerned  with  these  issues. 
Stake  also  defines  what  is  incant  by  “high”  and 
"low”  forms  of  evaluation.  In  high  forms  of 
evaluation,  the  results  arc  generalizablc  across 
schools,  situations,  and  students.  In  the  tow  form 
of  evaluation,  the  findings  are  restricted  to  the 
specific  research  situation  because  the  experi¬ 
mental  conditions  are  not  samples  of  the  universe 
of  conditions.  This  delineation  of  the  high  and  low 
forms  of  evaluation  is  analogous  to  the  random 
and  fixed-effects  models  referred  to  in  statistical 
(analysis  of  variance)  contexts.  Nonetheless,  many 
persons  engaged  in  student  measurement  and 
training  evaluation  research  have  used  fixed-effects 
designs  and  then  erroneously  generalized  to  other 
programs  of  instruction. 
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Flanagan  (1969)  and  Bloom  (1969)  define  what 
is  meant  by  the  terms  “formative”  and 
“summativc”  evaluation.  Formative  evaluation  is  a 
process  concerned  with  the 'development  of  an 
educational  program.  Summativc  evaluation, 
though,  is  primarily  concerned  with  evaluation  at 
the  end  of  a  program.  Staf  (1969)  feels  that  this 
distinction  between  su relative  and  formative 
evaluation  is  trivial  since  formative  evaluation 
never  ends  for  the  instructors  and  program 
developers.  A  program  is  summativc  only  for 
someone  who  is  outside  the  program  and  looking 
in  for  a  statement  of  its  effects. 

Thclen  (1969)  feels  that  the  role  of  evaluative 
measurement  is  “. . .  feedback,  diagnosis,  and 
steering...”  of  the  student.  Merwin  (1969), 
taking  a  broader  view,  thinks  that  there  are  three 
roles  for  evaluation:  (a)  school  planning  and 
administration  which  includes  pupil  classification, 
diagnosis  of  learning  disabilities,  appraisal  of  pupil 
progress,  identification  of  special  aptitudes,  pupil 
promotion,  and  effectiveness  of  teaching;  ( b )  in¬ 
struction,  its  diagnosis  and  effectiveness;  and  (c) 
student  decision  making  or  helping  the  students  to 
plan  and  evaluate  their  own  educational  experi¬ 
ences.  Similarly,  Cronbach  (1963)  lists  course 
improvement,  decisions  about  individuals,  and 
administrative  regulation  as  the  purposes  of 
evaluation. 

Wittrock  (1970)  defines  evaluation  as  m  .g 
decisions  and  judgments  about  instruction 
causes  of  learning.  It  is  noted  that  such  judt. . tents 
of  causal  relations  arc  difficult,  inasmuch  a:,  differ¬ 
ential  psychology  has  studied  individual 
differences  to  the  exclusion  of  cause  and  effect 
relations  among  learners,  educational  environ¬ 
ments,  and  learning.  The  evaluation  of  instruction, 
according  to  Wittrock,  should  include  observation 
of  the  student’s  environment  (r.g.,  teacher 
characteristics,  student  background),  evaluation  of 
the  learners  via  achievement  testing,  and  evalua¬ 
tion  of  learning  or  of  permanent  behavior  changes. 
Dcnova  (1968),  using  a  similar  paradigm,  says  that 
evaluation  has  three  components:  assessing 
changes  in  employee  (student)  behavior;  observing 
whether  training  helps  achieve  organizational 
goals;  and  evaluating  the  training  progrants,  tech¬ 
niques,  and  personnel. 

G.  Johnson  (1970)  lists  three  characteristics  of 
evaluation:  establishing  merit,  applications,  and 
multi,'1  imcnsionality .  Johnson’s  dimensions  of 
evaluation  arc  objectives,  processes,  components, 
end-products,  environmental  context,  secondary 
or  unplanned  effects,  and  costs. 


Angell,  Shearer,  and  Berliner  (1964)  list  four 
uses  for  evaluation  data:  (a)  early  detection  and 
correction  of  behavior;  ( b )  continual  modification 
of  instructional  procedures  when  appropriate;  (c) 
knowledge  of  whether  desired  achievement  levels 
have  been  attained;  and  ( d )  acquisition  of  learning 
curves. 

According  to  Gagne  (1970),  evaluation  has  two 
meanings.  The  first  meaning  of  evaluation  involves 
the  determination  of  the  worth  of  a  system  or 
program,  and  the  second  meaning  involves  deter¬ 
mining  if  learning  has  occurred.  These  uses  appear 
to  be  directly  analogous  to  the  topic  of  this  litera¬ 
ture  review.  Provus  (1969),  emphasizing  training 
functions,  thinks  that  the  purpose  of  evaluation  is 
to  determine  whether  to  improve,  keep,  or  end  a 
program.  Evaluation  is  agreement  with  program 
standards,  determining  if  a  discrepancy  exists  in 
some  aspect  of  the  program,  and  using  this  infor¬ 
mation  to  delineate  the  weak  points  of  the  system. 

Wiiey  (1970)  compares  and  contrasts  the  con¬ 
cept  of  evaluation  with  the  concepts  of  appraisal 
and  assessment.  According  to  Wiley,  assessment 
and  appraisal  involve  the  process  of  “. . .  judging 
what  is  valuable  and  ascertaining  the  particular 
levels  of  valued  tranr  (p.  260).”  Evaluation, 
though,  is  concerned  only  with  the  latter,  and  it 
must  be  empirical  and  behavioral.  Appraisal,  there¬ 
fore,  involves  a  designativc  and  an  evaluative 
function.  Continuing.  Wiley  says  that  “. . .  evalua¬ 
tion  consists  of  the  collection  and  use  of  infor¬ 
mation  concerning  changes  in  pupil  behavior  in 
order  to  make  decisions  about  an  educational 
program  (p.  261).” 

Jaeger  (1970)  feels  that  evaluative  techniques 
can  be  applied  to  institutional  decision  making  and 
educational  management.  Evaluation  can  be 
helpful  in  allocation  of  resources  in  terms  of 
educational  need,  in  modification  of  school  pro¬ 
grams,  and  in  promotion  of  oublic  understanding 
of  the  meaning  of  test  scores. 

Crawford  (1969)  and  Bcrdie  (1969)  both  have 
rather  contrasting  views  of  evaluation  usage. 
Crawfoid  feels  that  the  goals  of  evaluation  arc 
increased  efficiency,  decreased  time,  and  decreased 
costs.  Bcrdie,  though,  feels  that  the  uses  of  evalua¬ 
tion  are  educational,  vocational,  and  individual. 

Perhaps  the  best  statement  of  the  use  of  evalua¬ 
tion  is  given  by  Hemphill  (1969).  lie  says  that  the 
worth  of  an  evaluation  study  is  based  “. .  .on  its 
contribution  to  a  rational  decision  process  in 
which  it  is  necessary  to  estimate  the  probability  of 
a  desirable  but  uncertain  outcome  of  an  action 
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chosen  from  a  number  of  alternative  actions  (p. 
219).”  In  this  sense,  evaluation  is  an  aid  to  the 
decision  making  processes. 

Thus,  educational  evaluation  has  meant  a 
number  of  different  things  to  different  people. 
The  literature  indicates  it  to  be  multidimensional 
in  purposes,  and  these  purposes  seem  to  vary 
across  the  goals  of  the  evaluators.  Few  have 
separated  measurement  (the  act  of  deriving  data) 
from  evaluation  (the  judgments)  made  on  the  basis 
of  the  data.  Such  a  taxonomy  might  represent  at 
least  an  initial  step  toward  providing  a  unifying 
conceptual  scheme.  In  this  sense,  educational 
evaluation  is  a  process  which  is  used  to  make 
decisions  with  regard  to  instructional  programs, 
instructors,  students,  institutional  planning, 
administration,  and  costs.  Measurement  represents 
a  set  of  techniques  which  are  applied  to  derive  the 
data  on  which  the  evaluation  is  based. 

Specification  of  Objectives 

Many  writers  ( e.g Bloom,  1969;  Flanagan, 
1969;  Glaser,  1967, 1970;  Glaser  &  Glurizcr,  1958; 
Lavinsky,  1969;  Peck  &  Dingham,  1968;  Waina, 
1969;  Whitmore  1970a,  1970b,  1970c,  1970d) 
have  stressed  the  need  for  a  carefully  specified  set 
of  objectives  as  a  precursor  to  training  and  evalua¬ 
tion.  While  this  seems  self-evident,  early  specific¬ 
ation  of  objectives  often  seems  to  be  ignored.  Most 
of  the  sources  indicate  that  objectives  should  be 
defined  in  terms  of  skills  and  behaviors.  An 
essential  step,  then,  prior  to  the  specification  of 
objectives  is  a  behavioral  job  analysis  from  which 
the  basic  job  requirements  can  be  derived.  Tliis 
process  should  result  in  a  training  program 
composed  of  small,  discrete  units  with  each  unit 
having  its  own  objective.  Wittrock  (1970)  and 
Cronbach  (1963)  add  that  the  specification  of 
behavioral  objectives  allows  absolute  ra’t’cr  than 
relative  student  measurement.  This  enables  one  to 
determine  who  has  and  who  has  not  achieved  the 
objectives  rather  than  who  scores  best  or  worst. 

Bloom  (1969)  suggests  that  there  should  be 
consideration  of  the  intangible  outcomes  of 
instruction.  The  intangible  outcomes  may  be 
desirable  (e.g.,  stimulation  of  extra  reading)  or 
undesirable  (e.g.,  dislike  of  subject  matter),  which 
can  lead  to  a  revision  or  change  in  the  educational 
objectives.  These  outcomes,  however,  seem  quite 
amorphous  and  subject  to  considerable  measure¬ 
ment  error. 

At  a  still  higher  level  of  abstraction,  Carpenter 
and  Rapp  (1969)  would  determine  the  objectives 
of  training  by  removing  any  objective  which  is 


dependent  upon  another  (a  concept  which  is  theo¬ 
retically  neat  but  impractical);  eliminating  any 
objective  that  will  not  be  affected  by  the  choice  of 
alternatives  (a  rather  nonempirically  defined  con¬ 
cept);  and  finding  an  abstract  objective  to  which 
all  of  the  alternative  objectives  arc  means  (which 
leaves  the  weighting  of  the  alternative  objectives 
open). 

Thus,  the  determination  and  specification  of 
objectives  can  assume  a  number  of  levels.  These 
range  from  “objectively”  derived  statements  of 
required  skills  and  knowledges  through  motiva¬ 
tional  constructs  and  finally  through  complete 
abstraction. 

Systematic  Approaches  to  Course 
Development 

Approaches  to  course  development  have  also 
ranged  from  broad  based  molar  systems  through 
more  discrete  and  molecular  methods. 

Carss  (1969)  advocates  the  use  of  a  flow  chart 
model  of  the  educational  system  components  in 
order  to  derive  a  course.  This  model  should 
contain  the  flow  of  behaviors  or  acts  needed  to 
complete  training.  In  the  operation  of  the  educa¬ 
tional  system,  the  relevant  variables  are  identified 
and  quantified  und  converted  into  formulae  to 
determine  the  effect  of  output  ( e.g.,  student 
behavior)  when  different  inputs  arc  considered. 
This  is  a  simulation  technique  because  one  docs 
not  need  to  intervene  in  the  school.  In  addition. 
Carpenter  and  Rapp  (1969)  add  the  obvious  point 
that  when  different  systems  arc  being  compared, 
all  of  thc<r  aspects  which  could  affect  output 
should  be  the  same  except  for  those  being  studied. 

In  an  earlier  paper,  Glaser  and  Glanzcr  (1958) 
listed  four  requirements  for  course  development: 

1.  Specification  of  objectives- \  list  of  the 
objectives  of  the  course  in  behavioral  terms. 

2.  Input  control- Tire  selection  of  cnrollccs 
into  the  training  program  (e.g.,  number  of 
men  available,  testing  costs,  etc.) 

3.  Techniques  and  methods  of  training— 
Decisions  regarding  the  amount  of  practice, 
learning  guidance,  reinforcement,  extinction, 
training  sequence,  meaningful  relationships 
in  learning,  use  of  punishment,  learning 
plateaus,  motivation,  individual  differences, 
etc. 

4.  Output  control-’,  icasurcment  of  training 
(e.g.,  formative  evaluation,  setting  of  profi¬ 
ciency  standards,  diagnosis  of  training  in¬ 
adequacies.  performance  tests,  etc.). 
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Osborn  (1970)  presents  an  interesting  model 
which  he  calls  a  “closed  loop”  approach.  However, 
as  early  as  1950,  workers  in  the  area  have  regarded 
training  evaluation  to  feed  back  to  the  instruc¬ 
tional  process.  Thus,  the  closed  loop  concept 
would  not  be  regarded  as  a  “new”  development. 
Osborn  indicates  that  job  requirements  lead  to 
training  objectives  which  result  in  train*'  0  content 
and  performance  tests  which  ultimately  yield  an 
evaluation  of  the  quality  of  student  performance 
in  terms  of  job  requirements.  Osborn  feck  that  it 
is  often  too  costly  to  develop  a  full  field  perform¬ 
ance  test  for  a  large  number  of  individuals.  He 
suggests  a  matrix  approach  as  the  solution  to  this 
dilemma.  First,  the  job  components  (behaviors) 
are  listed  across  the  top  of  the  page.  Down  the  left 
side  of  the  page  is  a  list  of  the  potential  test 
methods  graded  in  degree  of  complexity  from  full 
field  to  paper-and-pencil  (e.g.,  simulations,  photos, 
pictures,  drawings).  Osborn  contends  that  many 
times  it  is  necessary  to  compromise-to  sacrifice 
relevance  and  diagnostic  capability  for  economy. 
The  alternatives  must  be  considered,  and  then  the 
most  complex,  yet  feasible  method,  must  be 
selected  and  used. 

The  sequence  of  course  development  used  in 
the  Army’s  Trainfirc  I  program  (Crawford,  1969) 
includes  (a)  job  analysis;  ( b )  transfer  of  the  job 
description  into  a  test  of  how  well  the  man 
performs  the  necessary  skills;  (c)  development  of 
new  training  stressing  realism,  clarity,  and 
simplicity;  and  ( d)  experimentation  using  a  con¬ 
ventionally  trained  group  and  an  experimentally 
trained  group  which  are  compared  on  the  test. 

Glaser  (Glaser,  1970a,  1970b;  Glaser  &  Cox, 
1968)  presents  a  somewhat  more  elaborate  model 
than  his  earlier  version  (Glaser  &  Glanzcr,  1958). 
This  new  model  includes  the  following: 

1.  Specification  of  objectives  in  terms  of 
observable  behavior.  Criterion-referenced 
measures  indicate  the  content  of  the 
subject’s  behavior  in  regard  to  the  objectives 
and  without  regard  to  the  performance  of 
others. 

2.  Diagnosis  and  profiling  of  the  subject  enter¬ 
ing  instruction.  The  types  of  entering 
behavior  that  need  measurement  arc 
previous  extent  of  achievement  in  the 
subject  area,  prerequisites,  learning  set 
variables,  ability  to  make  discriminations, 
and  general  intelligence. 

3.  Selection  of  “instructional  alternatives" 
based  on  the  diagnositic  and  profiling  step  of 
the  system. 


4.  Continuous  assessment  and  monitoring 
which  can  include  frequency  of  correct 
answers,  errors  in  relation  to  a  standard, 
speed,  transfer  and  generalization,  attention 
span,  and  response  latency. 

5.  Adaption  and  optimization.  The  treatments 
and  individual  differences  may  interact; 
therefore,  individuals  should  be  adapted  to 
the  best  treatment.  Those  that  interact  most 
with  the  treatment  arc  the  most  important. 
Decisions  about  treatments  should  be  made 
sequentially,  and  these  should  be  optimized 
by  using  quantitiative  methods. 

6.  Evolution  or  self-contained  improverr -it 
capability  that  modifies  itself  after 
acquisition  of  new  knowledge. 

A  system  which  mirrors  much  of  the  prior 
thinking  is  the  Instructional  System  Development 
(ISD)  technique  developed  by  the  Unin-d  States 
Air  Force  (Air  Force  Manual  50-2,  1970).  This 
system  in  its  latest  form  contains  the  following 
steps: 

1 .  Analyze  system  requirements 

2.  Define  education  or  training  requirements 

3.  Develop  objectives  and  tests 

4.  Plan,  develop,  and  validate  instruction 

5.  Conduct  and  evaluate  instruction 

Hunter,  Lyons,  MacCaslin,  Smith,  and  Wagner 
(1969)  feel  that  training  program  content  must  be 
job  relevant.  Taking  the  seven-step  Human 
Resources  Research  Organization  method  of 
curriculum  development  and  applying  it  to  what 
the  services  are  doing,  they  reported  several 
findings:  (a)  System  analysis  for  training  purposes 
was  not  used  in  any  of  the  services;  ( b )  there  was  a 
requirement  for  task  inventories  in  the  Atmy  and 
Air  Force;  (c)  there  was  no  development  of  a  job 
model  for  any  service;  ((f)  there  was  no  task 
analysis  for  curriculum  development;  (c)  all  serv¬ 
ices  said  training  objectives  should  be  job  relevant 
but  no  provision  wis  made  for  specificity;  (J ) 
training  program  development  procedures  were 
not  maximally  effective  because  the  objectives 
were  not  fully  specified;  (g)  very  little  or  no 
evaluation  and  assessment  of  training  effects  (the 
Air  Force  had  the  only  standards  of  graduate 
behavior  and  was  the  only  sendee  to  perform  field 
visits);  and  (ft)  training  accounted  for  6  percent  of 
the  defense  budget. 

In  summation,  the  systematic  approaches  to 
course  development  attempt  to  account  for  almost 


all  of  the  variables  that  can  affect  training  and 
student  behavior.  Most  of  the  systems  begin  with 
job  analysis  in  order  to  derive  a  set  of  behavioral 
job  requirements  from  which  training  objectives 
can  be  formulated.  Many  writers  advocate  a  pre¬ 
training  assessment  of  the  entering  students  in 
order  to  channel  them  to  the  training  program 
which  is  most  suited  to  their  needs  and  abilities. 
Performance  tests  and  other  measures  of  student 
behavior  are  then  constructed  in  order  to  reflect 
the  training  objectives.  Finally,  after  training  the 
students,  the  training  programs  are  evaluated 
through  various  means. 

Measures  and  Methods  of  Evaluation 

Campbell  (1971)  presents  a  rather  dim  picture 
of  the  current  state  of  methodology  in  training 
and  evaluation  literature.  He  feels  “.  .  .  by  and 
large,  the  training  and  development  literature  is 
voluminous,  noncmpirical,  nontlicorctical,  poorly 
written,  and  dull  (p.  565).”  Continuing,  Campbell 
says  that  “.  .  .  In  sum,  the  methodology  of  train¬ 
ing  and  development  research  cries  for  in¬ 
novation.  ...  As  yet  we  have  no  workable 
technology  that  is  capable  of  producing  a  large 
amount  of  training  research  data  (p.  579).” 

Similarly,  Schultz  and  Siegel  (1961a,  1961b)  as 
the  result  of  a  comprehensive  review,  observed 
earlier  a  need  for  a  unifying  conceptual  structure 
with  more  emphasis  on  theoretical  development  in 
the  area  of  job  performance  rather  than  technical 
advancements.  They  argued  for  more  research 
based  on  an  integrative  theoretical  framework 
rather  than  on  an  inductive  framework. 

Campbell,  Dunncttc,  Lawler,  and  Wcick  (1970) 
divide  training  criteria  into  two  groups.  Internal 
criteria  arc  those  directly  concerned  with  the  train¬ 
ing  itself,  while  external  criteria  measure  post¬ 
training  or  on-the-job  behavior.  These  authors 
recommend  the  use  of  multiple  criteria,  each 
reflecting  different  aspects  of  the  organization's 
goals.  Gagne  (1970)  presents  a  similar  dichotomy 
in  which  lie  stresses  initial  problems  directly 
connected  with  the  lesson  and  transfer  problems 
involving  principles  taught  in  the  lesson. 

Use  of  a  composite  overall  criterion  will  un¬ 
doubtedly  obfuscate  important  relationships  since 
many  of  the  subcritcria  within  the  composite  arc 
probably  orthogonal  (Cronhach,  1963).  According 
to  Dunncttc  (1963),  t*.  is  preferable  to  have 
multiple  criteria  in  order  to  account  for  a  greater 
proportion  of  the  behavior  variance. 

The  evaluation  or  measurement  must  not  be 
affected  by  the  method  of  measurement  or 


research  procedure,  liven  the  presence  of  the 
experimenter  or  the  process  of  evaluation  itself 
can  alter  the  results  (Bloom,  1969;  Cronbach, 
1963).  According  to  Gagne  (1970)  two  evaluation 
criteria  for  measures  are  “distinctiveness”  and 
“freedom  from  distortion.” 

Weiss  and  Rein  (1970)  claim  that  broad  based 
evaluation  programs  have  design  and  technical 
problems  so  ponderous  as  to  make  any  evaluation 
impractical  and  questionable.  They  propose  a 
devclopmcntally  oriented,  more  qualitative  evalua¬ 
tion  as  being  more  appropriate.  Weiss  and  Rein 
imply  that  where  there  arc  many  variables  to 
consider,  one  can  not  possibly  prove  or  disprove 
the  values  of  any  program. 

Biel  (1962)  says  that  “.  .  .  fundamental 
criteria  for  evaluating  a  simulation-based  training 
program  or  device  is  the  extent  of  transfer  of  train¬ 
ing  to  the  live  situation.  ...  In  cases.  .  .  where 
ultimate  criteria  arc  obviously  unavailable,  inter¬ 
mediate  criteria  must  be  employed.  One  example 
of  an  intermediate  criterion  is  performance  in  a  final 
examination.  .  .  Sometimes  improvement  as 
measured  by  performance  on  the  training  device 
itself  is  the  best  measure  available  of  the  effective¬ 
ness  of  the  device  and  its  associated  training 
program  (pp.  377-378).”  Gagne  (1968)  has  given  a 
similar  emphasis  to  transfer  of  training. 

Crawford  (1962)  and  Glaser  and  Klaus  (1962) 
posit  that  proficiency  tests  developed  from  job 
analysis  should  be  employed  to  evaluate  students 
and  training.  The  standards  on  the  proficiency  test 
must  be  bared  on  acceptable  or  adequate  job 
behavior. 

Cronbach  (1963)  feels  that,  in  training  evalua¬ 
tion  and  student  measurement,  the  testing  of 
terminology  which  is  specific  to  the  training 
course  should  be  kept  independent  from  tests  of 
understanding  of  content.  A  person  who  is  not 
taking  the  course  should  be  able  to  understand 
(not  necessarily  answer)  the  question.  Cronbach 
also  classifies  transfer  of  learning  into  an 
immediate  and  a  long-term  category.  Immediate 
transfer  involves  testing  the  student’s  course 
knowledge,  while  long-term  transfer  is  concerned 
with  aptitude  gain  and  learning  to  learn. 

Angcll,  Shearer,  and  Berliner  (1964)  list  three 
types  of  training  measures: 

1 .  Initial  measures  given  prior  to  instruction  or 
training  and  winch  arc  used  for  selection 
purposes.  The  correlation  between  the 
selection  tests  and  future  performance 
should  be  high. 
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2.  Interim  measures  taken  while  training  is  in 
progress,  and  .  .  they  arc  more  accurate¬ 
ly  predictive  of  terminal  proficiency  than  are 
measures  made  earlier  (p.  3).” 

3.  Terminal  measures  obtained  after  training  is 
completed  and  which  arc  predicted  by  the 
initial  and  interim  measures.  Some  examples 
of  terminal  measures  are  written  tests,  oral 
tests,  performance  tests,  expert  judgments, 
and  rating  scales. 

Peck  and  Dingtnan  (1968)  present  a  unique 
method  of  evaluating  student  teachers.  Training  is 
attained  when  each  of  the  training  objectives  is 
reached  by  the  student  teacher,  and  these  advances 
yield  significant  pupil  gains  in  the  classroom. 

Dclla-Piana  and  Berger  (1970)  have  provided  a 
design  for  conducting  pilot  studies  on  the 
efficiency  of  programmed  instruction.  They  begin 
with  six  to  eight  subjects  of  above  average  ability 
who  can  give  verbal  feedback  which  is  relevant  to 
program  revision.  The  subjects  arc  split  into  gioups 
of  three  or  four  each.  The  groups  arc  presented 
with  the  programmed  instruction,  and,  on 
completion  of  the  training,  they  arc  queried 
regarding  possible  revisions  for  the  program. 

Thclcn  (1969)  describes  diagnosis  (progression 
toward  goals)  and  troubleshooting  (difference 
between  what  exists  and  what  ought  to  be)  in  the 
context  of  group  instruction.  In  group  instruction, 
the  students  arc  unsupervised  most  of  the  class 
time,  and  the  instructor  can  only  hope  to  sample 
their  behavior.  In  a  highly  structured  class,  the 
evaluation  is  in  an  authoritarian  framework  in 
which  student  and  teacher  behavior  are  evaluated 
on  several  continua  from  good  to  poor.  This  can 
be  considered  evaluation  of  dcviancy.  In  the  un¬ 
structured  class,  no  set  of  criteria  for  describing 
deviant  behavior  can  exist.  All  behavior  is  thought 
to  be  relevant,  and  attempts  are  made  to  account 
for  it,  or  to  understand  why  it  occurred.  The 
authoritarian  teacher  knows  what  is  to  be  taught 
and  determines  the  extent  to  which  individuals 
differ  in  meeting  expectations.  The  more  demo¬ 
cratic  instructor  will  use  games,  ungraded  classes, 
small  work  groups,  and  student  cohcsivcness. 
Finally,  Thclcn  advocates  the  use  of  “barometric’’ 
individuals,  or  students  who  respond  consistently 
and  selectively  to  instruction  or  to  some  other 
important  group  condition. 

Wiley  (1970)  advocates  a  system  of  evaluation 
which  could  lead  to  a  great  savings  in  time.  First,  if 
all  the  students  in  the  class  receive  the  same 
experimental  treatment,  then  the  appropriate 
statistical  datum  is  the  class,  not  the  student. 


When  the  datum  is  a  collective,  one  can  sample 
from  it  and  save  considerable  time.  In  addition, 
one  does  not  have  to  give  each  student  all  the 
items.  Even  single  items  can  be  used,  and  they  arc 
easier  to  interpret  than  total  scores.  Jaeger (1970) 
uses  the  aforementioned  sampling  strategy  for 
institutional  decision  making. 

Wiley  also  introduces  some  new  terminology  in 
his  descriptive  system  of  evaluation.  First,  the 
standards  of  evaluation  involve  designating  traits 
to  evaluate  and  designating  the  levels  that  arc 
thought  to  be  appropriate.  Secondly,  the  object  of 
evaluation  is  the  instructional  progiam  and  its 
component  parts.  Next,  the  vehicles  of  evaluation 
are  directly  affected  by  the  objects,  and  they 
consist  of  students,  classes,  or  schools.  Finally,  the 
instruments  of  evaluation  display  the  behavior  of 
the  vehicles.  Wiley  says  that  the  fundamental 
problem  in  evaluation  “.  .  .  .  is  to  establish  the 
effects  of  the  objects  on  the  vehicles  by  means  of 
the  instruments  (p.  262).” 

Furno  (1966)  has  an  evaluation  approach 
confined  to  educational  surveys.  The  sequential 
elements  in  Furno’s  system  arc  (a)  specification  of 
survey  objectives;  (b)  definition  of  the  population; 
(c)  description  of  what  information  is  to  be 
collected;  (d)  determination  of  the  best  mode  of 
measurement;  ( e )  selection  of  the  sampling  unit; 
(/)  selection  of  the  sample;  (g)  planning  of  field 
work  so  that  it  will  be  carried  out  smoothly;  (/i) 
conduction  of  pilot  study;  (i)  provision  for  data 
processing;  (f )  analysis  of  data;  and  (k)  storing  of 
survey  information  and  providing  for  access  when 
needed. 

Somewhat  less  elaborate  arc,  Hawkridgc’s 
(1970)  seven  phases  of  evaluation  research:  (a) 
specification  of  objectives;  ( b )  selection  of 
objectives  to  be  measured;  (c)  selection  of  instru¬ 
ments  and  methods;  (</)  sample  selection;  ( c ) 
measurement  and  observation  schedule  develop¬ 
ment;  if)  choosing  analytic  techniques;  and  (g) 
drawing  conclusions  and  making  recommenda¬ 
tions. 

Campbell  (1970)  suggests  a  completely  selective 
approach  including  the  use  of  an  evaluation  model 
which  measures  trainee  reactions,  trainee  learning, 
trainee  behavior  on  the  job,  and  results  with  regard 
to  the  organization.  Campbell  concludes  that  too 
many  evaluation  studies  have  focused  on  the 
measurement  of  trainee  reaction  (c.g..  attitudes 
and  opinions),  to  the  exclusion  of  the  other 
dependent  measures. 

Flanagan’s  (1969)  system  of  evaluation  includes 
(a)  defining  the  outputs  of  the  system  including 
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the  objectives  and  unplanned  effects;  ( b )  selecting 
the  procedures  needed  to  measure  the  worth  of 
the  outcomes  (e.g,  costs,  benefits);  and  (c) 
composing  a  plan  based  on  analysis  including  a 
decision  and  overall  evaluation  of  the  final  pro¬ 
gram. 

Possibly,  an  evaluation  which  aims  to  be  at  all 
complete  should  include  consideration  of  most,  if 
not  all,  of  Scriven’s  (1967)  criteria.  They  include 
(a)  knowledge  of  specific  items  of  information  and 
patterns  and  sequences  of  information  items;  ( b ) 
comprehension  of  internal  relationships  within  the 
field  (e.g.,  inferences  and  implications),  interfield 
relationships  or  the  association  between  the 
knowledge  of  one  field  and  that  of  another,  and 
application  of  the  field  or  its  principles  to  an 
appropriate  example;  and  ( c )  motivation  and 
attitude  toward  the  course,  the  subject,  the  field, 
field  relevant  materials,  learning  and  knowledge 
activities  in  general,  school,  career  teaching,  the 
teacher,  peers,  and  self. 

Problems  of  Evaluation 

As  was  mentioned  previously,  Campbell  (1970) 
thinks  that  too  many  evaluation  studies  use 
measurement  of  trainee  reactions  to  the  exclusion 
of  trainee  learning,  trainee  behavior  on  the  job, 
and  effects  on  the  organization.  Trow  (1970)  feels 
that  much  innovation  in  training  is  done  for  its 
own  sake  to  relieve  boredom  and  only  secondarily 
for  its  outcomes.  Evaluation  studies  arc  too  often 
large-scale  and  aimed  at  funding  agencies  to  prove 
that  the  innovation  is  of  value. 

C.  Harris  (1970)  points  out  that  most  investi¬ 
gators  fail  to  integrate  prior  research  into  their 
experimental  designs.  He  goes  one  step  further  by 
posing  the  question  of  integrating  prior-  research 
findings  into  numerical  research  analysis.  Harris’ 
concept  would  be  feasible  if  more  collaboration 
could  be  achieved  among  different  agencies  and 
investigators.  A  related  problem  (Lortie,  1970)  is 
whether  or  not  ultimately  too  much  centralized 
evaluation  will  be  achieved  (without  realizing  it) 
through  the  use  of  computers  and  data  processing 
equipment.  Clearly,  an  optimum  middle  ground 
must  be  found. 

Student  measurement  can  have  both  positive 
and  negative  effects.  The  person  being  evaluated 
will  always  respond  to  evaluation  in  terms  of  the 
perceived  fairness.  If  he  pciceivcs  the  evaluation  as 
unfair,  the  person  being  evaluated  may  become 
resentful,  especially  if  the  evaluation  is  more 
critical  to  his  career  or  to  his  student  status 
(Bloom,  1970). 


Evaluation  cannot  function  in  an  authoritarian 
society  which  resists  soeiai  change.  Evaluation  also 
does  not  function  well  in  an  equalitarian  society 
because  all  persons  in  it  are  considered  equal.  In 
actuality,  evaluation  functions  best  in  a  com¬ 
petitive  society  (Berdie,  1969).  One  must  also 
consider  the  various  publics  at  which  the  evalua¬ 
tion  is  aimed.  These  publics  are  trainees,  trainers, 
sponsoring  organizations,  training  technicians,  and 
social  scientists.  The  value  of  a  particular  type  of 
training  must  be  presented  to  the  public  with 
which  it  is  concerned, iand  it  may  be  different  for 
each  public  (Bass,  Thia|arajan,&  Ryterban,  1968). 

Walker  (1965)  perfchfcned  a  study  illustrating 
one  of  the  most  serious  problems  in  evaluation 
research.  He  asked  20  training  experts  to  rate  16 
training  tcclmiques  with  regard  to  34  training 
selection  criteria.  These  training  personnel  tended 
to  select  training  methods  based  on  administrative 
and  contractual  needs  to  the  exclusion  of  training 
methods  based  on  educational  and  psychological 
principles.  Walker  concluded  that  this  group  of 
training  experts  was  more  concerned  with  budget 
and  training  time  than  with  learning. 

Berdie  (1969)  lists  conceptual  needs  and 
problems  of  evaluation  and  measurement.  He 
identifies  the  requirement  to  evaluate  whole 
persons  and  the  various  ways  in  which  traits 
cluster  together;  mid,  further,  the  need  to  know 
more  about  statistical  as  opposed  to  clinical 
prediction.  Breadth  of  evaluation  in  addition  to 
depth  of  evaluation  must  be  considered;  and 
various  statistical  modes  of  prediction  must  be 
attempted  (e.g..  moderator  variables). 

Sntodc,  Hall,  and  Meyer  (1966)  severely 
criticize  Air  Force  evaluation  research.  They 
contend  that  (a)  different  dependent  measures  are 
often  used  across  studies  leading  to  ircompara- 
bility  of  results^)  too  much  stress  is  placed  upon 
subjective  opinions  (e.g.,  rating);  (c)  different 
limits  or  standards  arc  used  for  describing  perform¬ 
ance;  (d)  too  many  personnel  and  equipment 
changes  occur  during  the  execution  of  many 
studies  resulting  in  a  lack  of  proper  research 
control;  (e)  different  methods  of  processing  and 
interpreting  the  transfer  of  training  data  arc 
employed;  (J)  presentation  of  the  same  study  in 
different  reports  makes  it  difficult  to  determine 
exactly  what  was  done;  (g)  inadequate  and 
imprecise  criteria  are  used;  ( h )  comparability  and 
control  of  skill  levels  of  subjects  and  trainees  are 
lacking;  (0  there  is  difficulty  in  matching  research 
criteria  and  tasks  to  flight  conditions  and 
demands;  and  (j)  there  is  disorganization  and  lack 
of  cooperation  among  researchers. 
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In  a  somewhat  different  context,  Suchman 
(1967)  presents  a  systematic  overview  of  the  short¬ 
comings  of  evaluation  research  in  general.  First, 
with  regard  to  objectives,  Suchman  feels  that 
certain  excesses  have  tended  to  characterize  the 
research:  too  much  arbitrary  problem  selection; 
too  much  stress  on  resources  and  material  and  not 
enough  on  achievement;  too  much  stress  on 
quantity  of  services  and  record  keeping  at  the 
expense  of  true  evaluation;  too  much  emphasis  on 
program  objectives  based  upon  tradition  and 
common  sense;  too  much  mixing  of  final,  inter¬ 
mediate,  and  immediate  objectives;  and  too  much 
idealism  and  not  enough  realism. 

In  listing  inadequacies  regarding  procedural 
methods,  Suchman  criticizes  the  excessive 
emphasis  on  research  based  on  available  or  existing 
records  which  discourages  the  gathering  of  new 
data;  the  absence  of  sound  experimental  designs, 
thus  making  it  difficult  to  determine  if  change  is 
the  result  of  innovation  or  chance;  the  use  of 
measurements  of  unknown  consistency  and 
accuracy;  the  use  of  weighting  methods  and 
standards  too  often  based  upon  rational  rather 
than  empirical  means;  the  inadequate  allowance 
for  or  control  of  demographic  variables  (e.g., 
locale,  race,  age)  making  interpretation  difficult; 
and  the  over-emphasis  on  correlation  with  in¬ 
adequate  attention  to  causality. 

Suchman  also  comments  on  the  administration 
of  evaluation  studies,  contending  that  evaluation 
guides  arc  too  often  used  by  unsophisticated 
persons,  thus  making  analysis  and  comparison  of 
ratings  difficult.  Further,  he  suggests  that  self- 
evaluations  arc  too  often  used,  which  allows  bias 
to  contaminate  data.  And,  finally,  when  super¬ 
visors  arc  forced  to  perform  evaluations  in 
addition  to  their  usual  activities,  it  becomes 
difficult  to  properly  plan,  organize,  and  conduct 
evaluation  studies. 

What  generalization  can  be  extracted  from  this 
mass  of  critical  rhetoric?  First,  these  writers  seem 
to  think  that  there  has  been  too  much  use  of 
rational  (armchair)  rather  than  empirical  methods. 
Similarly,  they  feel  that  evaluation  research  is  too 
often  subjective  when  objectivity  is  needed. 
Finally,  evaluation  research  is  too  often  limited  by 
monetary  considerations.  The  monetary  criticism 
is  probably  the  most  important,  since  most  of  the 
other  criticisms  can  be  reduced  to  it.  What  most 
investigators  do  not  realize  is  that  cost  cutting 
actually  wastes  money  because  the  results  of  the 
research  are  at  best  unintcrprctablc.  Many 
agencies,  contractors,  and  others  doing  research 
might  be  well  advised  to  save  their  money  and  do 
perhaps  one  or  two  sound  research  studies  rather 
than  five  or  six  poor  ones. 


Summary 

In  the  first  section  of  this  chapter,  the  roles, 
uses,  and  characteristics  of  evaluation  were  dis¬ 
cussed.  Evaluation  was  differentiated  from 
research.  Formative  and  summative  types  of 
evaluation  were  discussed.  Also,  evaluation  was 
contrasted  with  appraisal  and  assessment.  It  was 
concluded  that  evaluation  is  a  process  which  is 
used  to  make  decisions  with  regard  to  instructional 
programs,  instructors,  students,  institutional 
planning,  administration,  and  costs. 

The  second  part  of  this  chapter  contained  a 
short  discussion  of  object  j  -  s.  Most  of  the  sources 
reviewed  seemed  to  indicate  that  each  unit  of 
training  must  have  a  behavioral  objective  based  on 
the  job  requirements. 

The  third  portion  of  this  chapter  contained  a 
systematic  overview  of  approaches  to  evaluation 
and  course  development.  These  systems 
approaches  to  evaluation  and  course  development 
attempt  to  account  for  almost  all  of  the  variables 
that  can  affect  training  and  student  behavior. 

The  fourth  segment  of  the  chapter  consisted  of 
a  discussion  of  the  measurement  aspects  of  evalua¬ 
tion.  There  was  a  presentation  of  the  various  types 
of  criteria  that  can  be  used  in  evaluation  studies. 
Emphasis  was  place.)  on  the  multidimensional 
aspects  of  criterion  measurement.  Most  of  the 
writers  reviewed  suggested  that  transfer  of  learning 
was  the  ultimate  goal  of  training.  Also,  sampling 
procedures  were  suggested  as  a  means  of  saving 
time  and  costs  when  the  units  of  measurement  arc 
whole  classes  and  schools. 

The  final  section  of  this  chapter  presented  a 
discussion  of  the  various  problems  and  difficulties 
involved  in  evaluation  studies.  Several  conclusions 
were  drawn: 

1 .  There  is  too  much  use  of  rational  rather  than 
empirical  methods. 

2.  There  is  too  much  subjectivity  when 
objectivity  is  needed. 

3.  Evaluation  research  is  too  often  limited  by 
monetary  considerations. 


III.  OUANTITATIVi;  METHODS  AND 
DEPENDENT  MEASURES 

Characteristics  of  Dependent  Variables 

Fitzpatrick  (1970)  lists  four  characteristics  of 
criteria  which  he  thinks  are  essential  for  any 
evaluative  measure.  First,  the  criiciiu  must  be 
relevant  to  the  objectives  being  measured.  Second, 
the  criteria  must  be  comprehensive  and  cover  all 
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important  objectives.  Third,  the  criteria  must  be 
reliable  within  the  limits  of  cost.  Finally,  the 
criteria  selected  must  be  feasible,  and  this  is  deter¬ 
mined  almost  solely  by  cost. 

Bloom  (1970)  also  makes  a  set  of  very  relevant 
comments  concerning  validity  with  regard  to 
student  measurement  and  training  evaluation. 
Generally,  content  validity  is  stressed  in  training 
evaluation,  while  construct  validity  is  emphasized 
in  assessment  raid  appraisal.  Student  measurement, 
though,  usually  emphasizes  predictive  and  concur¬ 
rent  validity.  Bloom  feels  that  the  type  necessary 
should  be  determined  and  not  be  confined  to  one 
or  another.  Bond  and  Rigney  (1970)  add  that  the 
dependent  measure  which  “best  predicts  final 
performance”  should  always  be  selected. 

Several  indices  may  be  related  to  final  perform¬ 
ance,  and  the  computer  can  be  used  to  choose  and 
weight  them. 

Gideonse  (1968)  lists  several  types  of  measures 
that  can  be  used  for  measuring  students  and  for 
training  evaluation.  Gideonse’s  measures  are  (a) 
student  achievement  as  measured  by  tests  (which 
leaves  many  of  the  student’s  intellectual  qualities 
untapped);  (b)  a  desirable  change  after  a  stimulus 
input;  (c)  dropout  or  attrition  rate;  (d)  attitudinal 
and  motivational  measures;  ( c )  education  levels; 
and  (/)  facilities,  equipment,  materials,  human 
resources,  pupil  expenditure,  non-school  activities, 
organization  patterns,  and  administrative  agencies. 

Campbell  and  Dunncttc  (1968)  add  that  most 
T-group  research  involves  the  use  of  attitude  scales 
or  opinion  change  as  criteria  rather  than  organiza¬ 
tional  performance  or  improvement. 

Crawford  (1967)  indicates  that  proficiency 
tests,  when  used  to  evaluate  training  programs, 
should  not  just  be  used  at  the  end  of  training,  but 
should  also  be  used  to  test  retention  after  a  period 
of  disuse.  Similarly,  Martin  (19S7)  divides  criteria 
into  those  based  on  the  content  of  the  training 
program  (internal  criteria)  and  those  based  upon 
job  behavior  (external  criteria). 

Englcmann  (1968)  contends  that  there  arc  two 
kinds  of  conditions  which  can  indicate  that  learn¬ 
ing  has  occurred.  In  the  fixed  condition,  a 
response  or  instance  of  behavior  is  used  to  show 
that  learning  has  taken  place.  This  is  the  criterion 
of  performance,  in  the  variable  condition,  several 
responses  can  show  that  learning  has  occurred. 
One  can  easily  see  that  within  this  latter  condition, 
it  is  easier  for  the  student  to  demonstrate  that  lie 
understands  the  concept  being  taught  since  the 
requirement  for  learning  in  ‘.he  variable  condition 
is  dependent  on  a  concept  or  rule  and  not  on  a 


response.  Englemann  adds  that  both  the  fixed  and 
variable  conditions  are  needed  depending  upon  the 
situation. 

Kelley  and  Kelley  (1970)  document  a  unique 
type  of  dependent  measure  for  research  which 
holds  the  traditional  dependent  variables  of  speed 
and  accuracy  constant.  They  work  with  an 
“adaptive  variable”  which  is  the  adjustment  the 
student  must  make  to  obtain  a  certain  score  with 
speed  and  accuracy  held  constant.  The  adjustment 
is  the  dependent  variable,  and  it  can  be  any 
variable  which  affects  performance. 

Test  Construction 

Denova  (1968)  lists  the  steps  in  test  con¬ 
struction  as  follows:  (a)  defining  test  scope,  ( b ) 
defining  what  is  measured,  ( c )  choosing  items,  (cl) 
choosing  the  most  appropriate  testing  technique, 
(e)  determining  the  number  of  items,  (/)  choosing 
final  items,  (g)  arranging  items,  (/;)  writing  clearly 
understandable  directions,  (i)  constructing  a 
scoring  template,  and  (/')  evaluating  questions. 
Evaluation  of  the  test,  of  course,  involves  such 
factors  as  (a)  validity ,  (b)  reliability,  (c)  simplicity, 
(d)  distribution,  ( e )  content.  (/)  objectivity,  and 
(g)  difficulty  level.  Other,  more  exhaustive, 
accounts  of  test  construction  .and  its  concomitant 
problems  can  be  found  in  many  sources  such  as 
Air  Force  Manual  50-9  (1967),  Gronlund  (1968), 
and  Wood  (1960).  The  remaining  parts  of  this 
chapter,  therefore,  arc  devoted  to  some  new  tech¬ 
niques  and  applications. 

Horn  (1966)  feels  that  a  predictor  test  must 
have  internal  consistency  in  order  for  it  to  corre¬ 
late  adequately  with  a  criterion.  On  the  other 
hand,  he  feels  that  assessment  tests  need  represent¬ 
ativeness  of  content  regardless  of  internal  con¬ 
sistency.  He  demonstrated  that  his  own  classroom 
assessment  devices  were  more  like  predictors  than 
assessors.  Horn  concludes  that  there  is  no  reason 
why  assessment  devices  must  have  low  internal 
consistency  reliability. 

McGuire  and  Babbott  (1967)  constructed  a  test 
for  medical  students  consisting  of  a  series  of 
simulation  exercises.  The  test  begins  with  a  case 
write-up  and  several  possible  courses  of  action  or 
diagnoses.  Each  choice  the  student  makes  is 
branched  to  other  choice  points  until  the  patient  is 
either  dead,  transferred,  or  gets  well.  In  the  con¬ 
struction  of  the  test,  a  panel  of  experts  rated  each 
choice  along  a  five-point  scale  which  ranged  from 
“clearly  indicated’  to  “clearly  contra-indicated.” 
Several  possible  scores  result  from  the  procedure. 
Tire  efficiency  score  is  the  percentage  of  the 
student’s  answers  which  are  helpful  to  the  patient. 
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The  proficiency  score  is  the  percentage  agreement 
with  the  criterion  group  (optimal  patent  care). 
Proficiency,  then,  is  a  combination  of  errors  of 
commission  and  errors  of  omission.  The  composite 
score  is  a  function  of  proficiency  and  efficiency. 
According  to  McGuire  and  Babbott,  traditional 
multiple-choice  tests  take  a  portion  of  behavior 
and  treat  It  independently  of  tire  total  behavior 
pattern  of  which  it  is  a  part.  This  stresses 
“product”  as  opposed  to  “process.”  McGuire  and 
Babbott  conclude  that  their  test  stresses  the  pro¬ 
cess  aspects  of  behavior  and  that  it  is  uncorrclatcd 
with  most  multiple-choice  tests. 

Westbrook  and  Jones  (1968)  used  a  class  of 
psychology  graduate  students  to  construct  a 
multiple-choice  test  of  Anastasi’s  testing  book. 
There  were  54  items  in  form  A  and  54  items  in 
form  B.  The  Kuder-Richardson  reliability  was  .73 
and  the  split-half  reliability  was  .62.  The  tests  were 
validated  against  a  teacher-made  test,  resulting  in 
validities  of  .75  for  form  A  and  .59  for  form  B. 
Evidently,  graduate  students  can  be  used  to  con¬ 
struct  fairly  reliable  and  valid  tests. 

Gorth  and  Grayson  (1969)  developed  a  Fortran 
computer  program  which  can  “. .  .compose  and 
print  any  number  of  tests  consisting  of  questions, 
multiple-choice  or  completion  type,  selected  from 
an  item  pool  (p.  173).”  This  program  will  make  as 
many  copies  as  is  desired,  randomize  multiple- 
choice  answers,  and  print  scoring  keys.  Appar 
ently,  this  program  is  for  sale. 

Forrest  (1970)  wished  (o  develop  an  object ’vc 
flight  test  for  private  pilot  certification.  His  test 
consists  of  a  miniature  sample  of  flying  situations 
typically  met  by  pilots.  Eacli  situation  involves  an 
evaluation  and  an  action.  The  test  measures  (a) 
retention  and  recall,  ( b )  judgment,  (c)  planning 
and  problem  solving,  (</)  perceptual-motor  co¬ 
ordination,  and  (c)  habit.  The  actual  test  was  a 
cross-country  flight  with  a  pre-flight  and  an  in¬ 
flight  phase  (A'  =  15).  Scores  on  the  test  correlated 
.50  with  expert  ratings. 

Hierarchical  and  Sequential  Testing 

Hierarchical  and  sequential  tests  involve  a 
sequence  of  branching  in  which  the  student  only 
gets  items  at  his  own  level.  This  procedure 
decreases  testing  time,  increases  reliability,  and 
increases  student  motivation  •  because  lie  is  not 
forced  to  take  and  guess  at  the  more  difficult 
items.  Tiic  concept  was  introduced  in  early 
“intelligence  tests  ’  and  lias  recently  received  new 
emphasis.  An  cxaiwlc  of  the  application  is  the 
work  of  Cleary,  Linn,  and  Rock  (1968a,  1968b) 
who  wished  to  use  programmed  tests  to  decrease 


testing  time  while  leaving  reliability  and  validity 
the  same.  In  the  procedure  described  by  Cleary, 
Linn,  and  Rock,  each  student  receives  a  different 
set  of  items  along  a  scale.  Sequentially  pro¬ 
grammed  tests  have  a  routing  section  which 
branches  the  subject  to  the  appropriate  items  and 
a  measurement  section  containing  items  of  suitable 
difficulty.  The  routing  section  can  be  used  alone, 
although  these  investigators  used  a  combination. 
These  authors  used  the  test  scores  of  4,885  1 1th 
grade  students  on  the  School  and  College  Ability 
Tests  (SCAT)  and  Sequential  Tests  of  Educational 
Progress  (STEP).  The  sample  was  divided  in  half, 
with  the  second  half  used  for  cross-validation 
purposes.  The  subjects  in  the  initial  validation 
effort  were  routed  into  four  groups  using  four 
different  sequential  sampling  procedures.  One  of 
the  four  routing  methods,  the  sequential  method, 
produced  the  fewest  errors  of  classification  and 
the  highest  overall  correlation  with  the  total  SCAT 
and  STEP  test  scores.  The  sequential  method  uses 
fewer  items  for  those  easy  to  classify  and  more 
items  for  those  at  the  borderline  of  categories.  The 
measurement  test  is  constructed  by  obtaining  the 
items  with  the  20  highest  witliin-group  point- 
biserial  correlations  (excluding  the  routing  items). 
Computer  based  testing  could  facilitate  this 
procedure  because  of  speed,  flexibility,  con¬ 
venience,  and  immediacy  of  feedback.  This 
method  is  especially  suited  to  persons  at  the 
extremes  of  the  distribution  because  they  can  be 
quickly  routed  and  thus  save  time.  One  problem 
acknowledged  by  the  authors,  with  this  research 
effort,  is  that  the  SCAT  and  STEP  items  were 
taken  out  of  context  from  a  total  test.  This  could 
have  biased  the  results. 

Lord  (1971a,  1971b)  introduces  a  theoretical 
treatment  of  “tailored  testing”  which  is  a 
sequential  testing  procedure  consisting  of  one 
rather  titan  two  stages.  It  is  tailored  in  the  sense 
that  the  items  arc  those  that  arc  best  suited  to  the 
individual  being  tested.  “In  tailored  testi  -j'.  we  try 
to  choose  items  for  administration  that  arc  at  a 
difficulty  level  that  matches  the  examinee’s 
ability,  which  we  infer  from  his  responses  to  the 
items  already  administered.  .  .  .  when  the 
examinee  gives  a  wrong  answer  to  an  item,  the 
next  item  administered  should  be  an  easier  one; 
when  lie  gives  a  correct  answer  the  next  item 
administered  should  be  harder  (Lord,  1971a, 
pp.3-4).”  In  his  earlier  work,  Lord  (1969)  evolved 
a  two-stage  testing  procedure  using  similar 
principles. 

Ferguson  (1969)  used  a  computer  to  select 
items  on  the  basis  of  a  student’s  prior  responses. 
The  computer  will  keep  testing  the  student  until 
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he  satisfies  the  criterion  specified  by  the  training 
objective.  When  the  criterion  is  met,  Jhe  computer 
will  route  the  subject  to  the  next  training  objective 
containing  items  based  upon  tiic  student’s  profi¬ 
ciency  on  the  first  training  objective.  The  program 
was  successfully  used  with  75  elementary  school 
students  from  the  Pittsburgh  area. 

According  to  Gagne  (1967),  if  the  curriculum 
units  are  arranged  hierarchically,  and  the  test  items 
meet  standard  requirements,  a  hierarchical  testing 
procedure  will  be  implicit  since  most  people  who 
fail  the  lower  unit  will  not  pass  the  next  higher 
unit.  Moreover,  if  persons  who  pass  a  lower  unit 
fail  on  the  next  higher  unit,  an  additional  inter¬ 
spersed  unit  may  be  indicated.  Obviously,  this 
technique  can  also  indicate  whether  or  not  some 
units  have  been  reversed  in  the  hierarchy  of 
instruction. 

Criterion-  and  Norm-Referenced  Testing 

Glaser  ( 1 963)  and  his  colleagues  (Glaser  &  Cox, 
1968;  Glaser  &  Klaus,  1962;  Glaser  &  Nitko, 
1971),  as  well  as  Popham  (1969),  Carver  (1970), 
and  Holtzman  (1971),  have  all  written  on  the 
topic  of  criterion-referenced  versus  norm- 
referenced  testing.  The  characteristics  of  criterion- 
referenced  tests  arc  that  they  (a)  indicate  the 
degree  of  competence  attained  by  an  individual 
independent  of  the  performance  of  others;  ( b ) 
measure  student  pcrfonnance  with  regard  to 
specified  absolute  standards  of  ,  erformancc;  ( c ) 
minimize  individual  differences;  and  (d)  consider 
variability  irrelevant. 

Generally,  from  these  statements,  it  can  be  seen 
that  criterion-referenced  tests  tell  how  the  student 
is  performing  with  regard  to  a  specified  standard 
of  behavior.  Individual  differences  are  considered 
irrelevant,  since  the  student  is  graded  against  a 
single  standard  rather  than  against  all  the  others 
taking  the  test.  Assigning  grades  of  competence  to 
rtudents  on  the  basis  of  relative  performance, 
when  it  is  not  really  known  whether  any  of  tiic 
students  have  attained  a  specified  behavioral 
objective,  makes  very  little  sense.  One  can.  though, 
derive  individual  differences  from  criterion- 
referenced  tests  by  specifying  the  degree  of 
competence  reached  by  each  student. 

Simon  (1969)  thinks  that  there  is  no  real 
difference  between  criterion-  and  norm-referenced 
tests.  Whether  a  test  is  one  or  the  other  depends 
upon  how  the  scores  arc  used. 

Glaser  (1963)  and  Glaser  and  Cox  (1968) 
discuss  the  use  of  norm-referenced  achievement 


tests  and  criterion-referenced  tests  in  differentia¬ 
ting  among  individuals  and  treatment  groups. 
When  evaluating  individuals,  one  needs  to  use  an 
achievement  test  containing  items  with  different 
difficulty  levels.  For  evaluating  treatments  or 
experimental  conditions,  though,  one  needs 
perfect  post-treatment  answers  and  incorrect  pre- 
treatment  answers  so  that  the  dependent  measure 
is  maximally  sensitive  to  training  change.  In  this 
latter  case,  criterion-referenced  tests  arc  most 
appropriate. 

K.  Johnson  (1969a,  1969b)  suggests  that  train¬ 
ing  evaluation  should  use  criterion-referenced 
tests,  but  that  they  are  costly  and  just  not  feasible 
for  many  training  situations.  Johnson’s  purpose 
was  to  determine  the  degree  which  other  measures 
(c.g.,  'norm-referenced  tests,  student  and  instructor 
attitudes)  can  be  used  as  substitutes  for  criterion- 
referenced  tests.  Reliabilities  were  calculated  for 
three  measures  on  four  courses  taught  at  the  Naval 
Air  Technical  Training  Center.  In  one  course  there 
was  a  comparison  with  criterion-iefcrenced  tests. 
The  reliabilities  for  all  three  methods  were  fairly 
high,  but  a  large  number  of  items  was  needed  (i.e., 
more  than  20)  to  get  an  adequate  reliability  for 
norm-referenced  tests.  Student  and  instructor 
attitudes  were  highly  correlated,  but  neither  had  a 
high  correlation  with  nonn-rcfercnced  tests.  Each 
of  the  three  measures  accounted  for  27  to  43 
percent  of  the  variance  of  scores  on  criterion- 
referenced  tests.  Without  defining  what  he  con¬ 
sidered  to  be  an  adequate  substitute,  Johnson 
concluded  that  none  of  the  other  methods  is  an 
adequate  substitute  for  criterion-referenced  tests. 

Siegel,  Schultz,  and  Lantcrman  (1964)  and 
Siegel  and  Fischl  (1965)  sought  to  develop  a 
criterion-referenced  evaluation  scheme  for  the 
Navy  electronics  technician  rating.  What  is  unique 
and  interesting  about  these  studies  is  that  the 
criterion  referencing  was  done  in  combination 
with  Guttman  scaling  procedures.  Their  technique 
involved  (a)  assembling  statements  of  the  specific 
system  objectives  of  Naval  air  electronics;  ( b ) 
weighting  these  objectives  on  the  basis  of  the 
importance  of  their  respective  contributions  to 
system  requirements;  and  (c)  psychophysically 
establishing  cut  points  on  a  Guttman-type  job 
pcrfonnance  scale,  the  cut  points  representing 
levels  of  skill  required  in  order  for  each  of  the 
objectives  to  be  met.  The  resultant  Technical 
Proficiency  Checkout  Form  Scales  (TPCF)  were 
found  to  correlate  between  .65  and  .74  with 
pcrfonnance  test  scores. 
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Ratings 

Rating,  although  widely  used,  is  one  of  the 
most  unreliable,  biased,  and  contaminated 
methods  for  evaluating  performance.  Several 
factors  which  can  contribute  to  poor  or  in¬ 
adequate  ratings  are  (a)  friendship,  ( b )  quick 
guessing,  (c)  jumping  to  conclusions,  (d)  first- 
impression  responses,  (e)  appearance,  (J ) 
prejudices,  (g)  halo  effects,  (//)  errors  of  central 
tendency,  and  (/)  leniency.  Of  these,  the  last  three 
are  probably  the  most  important.  Halo  exists  when 
a  rater  allows  his  overall,  general  impression  of  a 
man  to  influence  his  judgment  of  each  separate 
trait  on  the  rating  scale.  Errors  of  leniency  occur 
when  a  rater  tends  to  use  only  the  upper  portion 
of  the  rating  scale  when  rating  all  or  most  of  his 
men.  Errors  of  central  tendency  occur  when  the 
rater  uses  only  the  middle  portion  of  the  rating 
scale  when  rating  his  men.  Considerable  evidence 
exists  which  demonstrates  that  rater  training  can 
reduce  these  sources  of  bias  so  that  the  resultant 
ratings  are  at  least  minimally  useful  (Bergman  & 
Kujawski,  1969). 

Howard  and  Corrcll  (1966)  wanted  to  deter¬ 
mine  if  there  was  a  consensus  with  regard  to  the 
acceptability  of  various  behaviors  of  psychological 
interns  among  those  responsible  for  training  them. 
The  trainers  were  given  a  list  of  27  critical  incident 
statements  and  were  asked  to  indicate  whether  the 
behavior  described  in  the  incident  was  charac¬ 
teristic  of  a  beginning  trainee,  an  intermediate 
trainee,  or  a  senior  trainee.  In  many  instances, 
university  based  trainers  used  more  lenient 
standards,  and  in  other  instances  agency  trainers 
used  more  lenient  standards.  Theic  was,  of  coarse, 
some  agreement  across  universities  and  agencies. 
Overall,  some  behaviors  thought  to  be  charac¬ 
teristic  of  beginners  in  one  place  were  thought  to 
be  characteristic  of  senior  trainees  in  another 
place.  The  authors  concluded  that  more 
uniformity  is  needed  because  of  the  widely  differ¬ 
ing  standards  of  behavior. 

In  another  study,  Edwards  (1968)  had  the 
teachers  from  five  nursing  schools  rate  the  per¬ 
formance  of  55  of  their  senior  nursing  school 
students  on  their  performance  under  three 
conditions,  (a)  situations  requiring  interpersonal 
physical  care;  ( b )  situations  needing  technical 
skills;  and  (t)  conditions  requiring  non-physical, 
interpersonal  patient  care. 

Evaluations  w*  v  made  by  the  operating  room 
instnictor,  the  medical  nursing  instructor,  and  the 
psychiatric  instructor,  -ill  trainees  were  rated  from 
A  to  E.  The  results  showed  that  all  interrater 
correlations  wore  rery  low  (.5  at  most).  The  only 


fairly  high  correlations  were  within  instructors 
across  specialties.  The  authors  indicate  that  these 
unreliable  results  were  caused  by  (a)  teacher  per¬ 
sonality,  ( b )  relations  with  students,  (c)  differ¬ 
ential  behavior  of  students,  and  ( d)  differential 
teacher  criteria.  The  ratings  also  had  a  disappoint¬ 
ing  relationship  with  test  scores  and  grades  within 
specialty.  The  ratings  correlated  -.01  to  .27  with 
test  scores  and  .20  to  .49  with  grades. 

Greer,  Smith,  and  Hatfield  (1967)  constructed 
a  standard  system  of  chcckpilot  helicoptoi  evalua¬ 
tion  in  order  to  overcome  effects  of  the  check- 
pilots’  proclivity  to  rate  on  the  basis  of  their  own 
personal  standards  rather  than  on  student  flying 
skill.  First,  the  training  program  was  evaluated  in 
terms  of  maneuver  components.  Specific  profi¬ 
ciency  scales  and  instrument  observation  were 
used  as  criteria  instead  of  the  chcckpilot’s  own 
schema.  From  this  early  work  the  Pilot  Perform¬ 
ance  Description  Record  (PPDR)  was  constructed. 
The  PPDR  consisted  of  items  reflecting  the  most 
critical  aspects  of  each  maneuver.  Fifty  inter¬ 
mediate  and  50  advanced  helicopter  students  were 
each  given  chcckridcs  with  one  research  staff 
member  and  one  chcckpilot.  Prior  to  this,  some  of 
the  checkpilots  were  trained  in  the  use  of  the 
PPDR  to  reduce  chcckpilot  differences  in  scoring 
standards.  The  results  showed  that  (a)  the  relia¬ 
bility  of  flight  proficiency  evaluations  improved; 
( b )  the  PPDR  recorded  specific  student  deft- 
cicnccs:  (c)  checkpilots  trained  in  use  of  the  PPDR 
were  more  consistent  in  their  evaluations  than 
checkpilots  who  were  only  oriented  in  the  PPDR; 
and  ((/)  chcckpilot  training  is  necessary'  when  using 
the  PPDR. 

In  another  study,  Greer  (1968)  wished  to 
increase  the  reliability  of  chcckpilot  ratings  which 
i >  pically  averaged  .20.  Checkpilots  were  asked  to 
complete  an  1 1-itcm  rating  form.  Those  who 
agreed  with  an  r  of  .90  or  better  were  paired 
together  with  students;  the  resultant  correlation 
was  .65. 

Duffy  (1968),  Duffy  and  Jolley  (1968).  and 
D.ii.y  and  Anderson  (1968)  wished  to  develop  an 
objective  recording  device  to  score  student 
heheq.  r  check  rides.  The  students  were  scored 
duriii?  a  d  after  training  and  on  maneuvers,  All 
data  were  recorded  on  IBM  cards,  and  a  class 
percentage  error  and  a  school  average  were 
.  bulatcd.  If  certain  types  of  errors  tended  to 
show  up  under  one  instructor  in  one  aspect  of 
training,  the  instructor  was  given  additional  in¬ 
structional  training.  If  one  chcckpilot  was  found 
to  be  more  strict  than  the  other,  lie  was  also  given 
counsel  to  make  his  ratings  less  strict. 
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Caro  (1968)  undertook  a  study  to  compare 
grades  given  by  chcckpilots  and  grades  given  by 
instructors  before  and  after  innovations  in  rating 
were  introduced.  A  second  study  was  performed 
to  determine  if  grades  were  influenced  by  the 
checkpilot’s  relationship  with  the  students  or  the 
instructors.  To  eliminate  bias  due  to  prior  knowl¬ 
edge,  40  of  60  subjects  were  given  checkridcs  by 
chcckpilots  outside  the  classes  studied.  The 
principal  results  of  concern  from  these  two  studies 
suggested  that  (a)  there  were  high  correlations 
between  instructors  and  chcckpilots  from  the  same 
classes;  ( b )  there  was  no  relationship  between  in¬ 
structors  and  checkpilots  from  outside  the  classes: 
(c)  student  grades  were  affected  by  the  individual 
standards  of  the  uicckpilot;  (d)  specific  infor¬ 
mation  was  collected  by  the  chcckpilot  on  the 
student’s  flight,  but  not  systematically  or  con¬ 
sistently;  and  (<*)  there  were  no  differences  after 
the  new  grading  procedures  were  introduced. 

Jenkins,  Ewart,  and  Carroll  (1950)  sought  to 
develop  an  index  of  combat  effectiveness  against 
which  tests  could  be  validated.  They  used  the 
nomination  technique  which  asks  each  man  to 
name  two  with  whom  he  would  like  to  fly  wing 
and  two  with  whom  he  would  not  like  to  fly  wing, 
together  with  the  reasons  for  his  choices  (checked 
off  on  a  22-item  checklist).  Data  were  collected  on 
2,274  high  and  1 ,829  low  and  228  mixed  pilots. 
The  results  showed  that  the  nominations  were 
related  to  the  rank  of  the  officer  and  that  their 
reliability  was  .80.  The  reasons  for  the  nomina¬ 
tions  were  more  reliable  for  the  lows  than  for  the 
highs.  Also,  there  was  a  different  frequency  of  use 
of  reasons  for  different  ranks  (e.g.,  senior  officers 
more  often  avoided  going  on  combat  missions  tha.  t 
junior  officers).  A  factor  analysis  of  the  checklist 
data  delineated  several  underlying  factors:  (a) 
sociability,  ( b )  practical  intelligence,  (c)  cool- 
headedness,  (d)  combat  aggressiveness,  (c)  flying 
skill,  (/)  teamwork,  (g)  leadership  (highs  only),  and 
(h)  reaction  to  failure  (lows  only).  A  second  order 
factor  analysis  resulted  in  two  high  factors 
(fighting  ability  and  capacity  for  combat  leader¬ 
ship),  and  three  low  factors  (emotional 
inadequacy,  fear-impulsive  foolish,  and  lack  of 
practical  intelligence).  All  of  the  aforementioned 
factors  were  orthogonal.  Those  interesting  results 
notwithstanding,  the  ratings  failed  to  predict 
combat  success,  even  with  rank  controlled. 

In  another  study,  Ycllcn  (1969)  used  co-worker 
or  peer  ratings  as  criteria  of  performance  for  field 
artillery  crewmen.  The  multiple  correlation 
between  these  ratings  and  a  weighting  of  the  major 
areas  of  a  proficiency  test  was  .71. 


In  one  final  study  (Flauglier,  Campbell,  &  Pike, 
1969),  white  and  black  medical  technicians  were 
rated  on  job  performance  by  both  white  and  black 
supervisors.  White  supervisors  tended  to  rate  the 
whites  slightly  higher  than  the  blacks,  while  black 
supervisors  rated  blacks  considerably  higher  than 
whites. 

in  summation,  ratings  tend  to  improve  to  the 
extent  that  the  influence  of  the  rater’s  own  idio¬ 
syncrasies  are  prevented  from  affecting  his 
observation  of  subordinate  behavior.  The  evaluator 
must  observe  and  record  behavior  in  objective 
terms.  If  this  suggestion  seems  mechanistic  and 
devoid  of  rater  influence,  it  is  meant  to  be  that 
way.  The  more  the  rater  can  become  like  a 
behavioral  metering  device,  the  less  likely  lie  will 
contaminate  the  evaluation.  Also,  it  will  help 
immensely  if  the  rating  items  arc  couched  in 
behavioral  rather  than  in  relative  or  evaluative 
(e.g,  above  average)  terms.  Finally,  performance 
evaluations  should  not  be  tied  to  salary  review 
unless  they  arc  to  be  used  for  that  purpose. 

In  general,  ratings  are  much  used  and  conven¬ 
ient  although  they  are  at  best  a  haphazard  method 
of  evaluating  training  performance,  student 
achievement,  or  job  behavior.  If  other,  more 
objective  methods  are  feasible,  they  should  be 
used. 

Cost  Effectiveness 

Alkin  (1970)  has  wi'ttcn  an  extensive  treatise 
on  cost-benefit  analysis.  Some  of  his  comments 
and  suggestions  arc  reviewed  in  the  ensuing  para¬ 
graphs. 

Generally,  cost-benefit  analysis  is  the  analysis  of 
the  costs  and  benefits  of  various  alternative 
courses  of  action.  The  decision  maker  selects  the 
method  giving  the  largest  yield  at  a  given  cost,  or 
the  most  benefit  for  the  least  cost.  Input  and 
output  must  be  measured  in  dollar  terms.  Cost- 
benefit  studies  arc  usually  large-scale.  For  instance, 
the  costs  of  college  education  can  be  compared 
with  the  resultant  increase  in  productivity  yielded 
by  the  college  education. 

The  manipulatablc  characteristics  arc  the  con¬ 
ditions  whose  variations  maximize  or  minimize 
student  output.  The  manipulatablc  characteristics 
which  affect  student  output  are  (a)  student  inputs 
measuring  the  achievement  starting  point  of  the 
student; (b)  financial  inputs  or  the  funds  allocated: 
(c)  external  system  which  is  the  giver  of  inputs  and 
the  receiver  of  outputs  (c.g.,  society);  and  (</)  in¬ 
struction,  supplies,  tests,  and  similar  items. 
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With  regard  to  the  outcomes  of  cost-benefit 
analysis,  the  analyst’s  interest  is  in  how  the 
student  has  changed  in  short-  and  long-term  ways 
(c.g.,  how  well  he  deals  with  other  schoolwork  and 
his  society).  Although  there  arc  financial  inputs, 
there  are  no  financial  outcomes  except  those 
derived  from  behavior  changes.  There  are  also 
non-student  outcomes  which  comprise  items  such 
as  teacher  salaries  and  number  of  personnel  used  in 
the  program. 

Alkin  sees  three  major  problems  in  evaluating 
the  cost  effectiveness  of  manipulatablc  variables. 
They  include  (a)  difficulty  in  getting  accurate  cost 
data;  (b)  difficulty  “. .  .in  dealing  with  cost- 
effectiveness  estimates  in  the  light  of  system- 
interrelationships  (p.  235);”  and  (c)  problems  in 
generalizing  to  specific  individual  cases. 

Hawkridge  (1970)  says  that  there  are  two 
evaluation  loops  regarding  money  allocated  for 
educational  programs.  These  two  loops  arc  the 
“philanthropic”  and  die  “conservative.”  As  soon 
as  money  is  allocated,  many  programs  spring  up.  If 
die  evaluation  is  done  poorly  or  unreliably,  then 
die  money  is  cut  back  and  the  fust  diing  the 
program  administrator  usually  does  is  cut  evalua¬ 
tion  cost  so  he  can  keep  other  aspects  of  the 
program.  One  can,  of  course,  stay  in  the  philan¬ 
thropic  loop  if  sound  evaluation  is  performed. 

Gubins  (1970)  performed  a  cost-benefit  anal¬ 
ysis  of  training  programs  for  the  hard  core 
unemployed.  In  this  case,  cost-benefit  analysis  is 
based  on  the  cost  of  unemployment  and  the  gain 
from  investment  in  these  human  resources. 
Gubins’  findings  suggested  die  impact  of  increasing 
the  number  of  hard  cote  unci  ;n!nycd  in  govern¬ 
ment  training  programs:  (a)  Programs  were  still 
“economically  efficient.”  (6)  There  were  greater 
gains  by  trainees  with  less  than  nine  years’  educa¬ 
tion  o'.er  trainees  with  greater  than  nine  years’ 
education;  therefore,  the  basic  education  portion 
of  training  is  of  most  value,  (c)  Training  was  more 
beneficial  for  those  less  than  22  years  of  age  than 
for  those  greater  than  22  years  of  age.  ( d)  Trainees 
gained  financially  after  undergoing  training. 

S.  Allison  (1969)  developed  a  cost-estimating 
model  for  undergraduate  pilot  training.  Inputs  to 
Allison’s  model  consist  of  or  can  be  (a)  under¬ 
graduate  pilot  training  graduation  requirements, 
( b )  course  requirements,  (c)  instructor-student 
ratios,  (c/)  administrative  and  support  manpower 
relationships,  (e)  number  of  aircraft  and  simulators 
available,  (/)  quantity  of  facilities  available,  and  (g) 
cost  relationships.  The  model,  given  the  inputs, 
computes  the  cost  required  for  training  in  terms  of 


research  and  development  costs,  investment  costs, 
annual  operating  costs,  and  long-range  feasibility 
estimates. 

The  Ozarks  Regional  Commission  presented  a 
rather  detailed  account  of  their  cost-effectiveness 
system  (Manurl,  1970).  The  goal  of  the  commis¬ 
sion  is  dost  g  die  “income  gap”  between  the 
Ozark  region  and  the  rest  of  the  nation.  They 
wanted  to  measure  the  additional  value  of 
occupational  education  in  the  Ozark  region.  They 
saw  their  major  problems  as  transposing  the  gains 
and  losses  into  dollar  terms.  Benefits  are  calculated 
in  terms  of  what  buyers  and  users  of  the 
commodity  will  pay,  or  in  terms  of  production 
costs  if  the  former  are  not  available.  Costs  consist 
of  the  value  of  the  goods  and  services  used  up  in 
the  project  as  compared  with  their  use  for  other 
purposes.  This  is  called  the  value  of  alternate  uses. 
If  no  alternate  use  exists,  the  costs  are  zero. 

Intangible  costs  and  benefits  cannot  be  put  into 
dollar  terms,  but  they  can  be  quantified  and 
compared  in  terms  of  alternate  courses  of  action. 
If,  among  two  projects,  A  gives  more  net  benefits 
than  B.  but  if  B  has  intangible  benefits  which  over¬ 
ride  the  net  benefits  of  A,  then/?  might  be  chosen 
as  the  course  of  action. 


Some  of  the  Ozark  commission’s  cost- 
effectiveness  formulae  are  presented: 

1.  cost  benefit  of  the  program  = 


/program  cost 
per  student 


tuition  and  ' 
books  per  student 


(annual  income  generated  per  student 


(cnrollees- dropouts  \ 
cnrollces  ) 

(program  tcngtli 
in  months 


student  participation 
in  program  in  months 
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2.  facility  cost  program  = 
/cnrollees-dropouts)  ^  /program  length) 


(cnrollces 
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(space  allocation)  ^  / space  cost  )  ^ 
in  square  feet  J  (cost  period ) 

'cost  benefit  \ 
of  program  } 

3.  cost  benefit  equipment  = 

(1  \  ^  /length  of  time  equipment  available  \ 

enrotlccs )  (tin.'-  equipment  used  in  months  j 

(equipment  cost  \  X  ( cos*  benefit) 

period  equipment  usable  ( 10  years) )  !’rogram  j 
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Gain  Scores  and  Final  Examination 
Grades 

Carver  (1966,  1969,  1970)  presents  a  rather 
conclusive  argument  against  the  use  of  gain  or 
difference  scores  in  evaluation  research.  The  prob¬ 
lem  in  the  before-and-after  measurement  of  gain 
scores  is  that  when  small  significant  increases  are 
registered,  there  may  actually  be  a  tremendously 
large  increase  in  knowledge.  This  paradoxical 
result  conics  from  the  inequality  of  measurement 
at  different  points  along  the  scale.  Carver  hypoth¬ 
esizes  that  a  curvilinear  relationship  exists  between 
test  scores  and  knowledge,  with  knowledge 
increasing  faster  than  test  scores.  One  can  rarely 
find  a  significant  positive  correlation  between 
initial  test  scores  and  gain  scores  (often  there  is  an 
inverse  correlation).  This  is  contrary'  to  expecta¬ 
tion,  since  it  is  expected  that  the  more  intelligent 
student  will  learn  more  and  that  the  more  in¬ 
terested  student  will  be  motivated  to  study  more. 
One  can  partially  explain  this  finding  on  the  basis 
that  students  who  already  know  a  lot  do  not  have 
much  left  to  learn.  Another  related  problem  is  the 
ceiling  effect  which  occurs  when  the  initially 
bright  student  already  has  most  of  the  items  on 
the  pretest  correct  and  docs  not  have  much  room 
for  improvement.  Carver  indicates  that  final 
examination  grades  constitute  a  dependent  vari¬ 
able  measure  'hat  is  superior  to  gain  scoics,  out 
with  certain  restrictions:  The  ratio  of  final  knowl¬ 
edge  to  initial  knowledge  must  be  considerably 
greater  than  one;  the  correlation  between  initial 
knowledge  and  final  knowledge  must  remain  high; 
and  the  variance  of  final  knowledge  must  be 
greater  than  the  variance  of  initial  knowledge. 

Carver  (1969)  offers  another  solution-one 
involving  separation  of  the  initially  bright  from  the 
initially  dull  students.  This  is  done  to  correct  a 
motivation  problem  for  the  initially  high  scoring 
student  who  has  to  waste  time  completing  items  at 
a  iow  level.  It  is  possible  that  if  the  bright  student 
started  off  at  a  high.”  level,  his  gain  may  have  been 
greater.  On  this  basis  Carver  concludes  that  final 
scoics  arc  the  best,  because  of  unacceptable 
solutions  using  functions  of  initial  and  final  scores 
and  because  expectations  arc  not  confirmed  about 
initially  bright  students.  Guilford  (1970),  though, 
feels  that  absolute  scaling  methodology  might 
offer  a  solution  to  this  dilemma. 

Bcreiter  (1963)  presents  certain  other  related 
pro1  lems  in  the  measurement  of  change: 

!.  The  "overcorrection  undercorrcction 
dilemma”  which  occurs  when  there  is  a 


negative  <  oriels tiort  between  the  initial  score 
and  the  g3in  score.  This  can  be  corrected  so 
that  a  positive  correlation  can  exist  between 
initial  and  gain  scores. 

2.  The  “unreliability-invalidity  dilemma’ 
white  occurs  when  diere  is  a  high  corre¬ 
lation  between  pretest  and  post  test,  thus 
lowering  ‘he  reliability  of  the  difference 
scores.  If  u>t:  obtains  rcliabie  difference 
scores  because  of  a  low  pretest -post  test 
correlation,  then  the  less  we  can  say  about 
the  gain. 

3.  The  “physicalisin-subjc.tivism  dilemma” 
which  involves  the  choice  ./  the  scale  units 
given  versus  units  conforming  ;c  psycho¬ 
logical  mcaningfulness.  Berciter  recommends 
the  use  of  terminal  scores  because  change 
scores  create  too  many  problems. 

Confidence  Testing  and  Partial 
Knowledge 

Shuford,  Albert,  and  Masscngil!  (1966)  and 
Shuford  (1967)  have  constructed  a  scheme  to 
provide  for  more  adequate  measurement  of 
student  knowledge  than  is  possible  with  traditional 
testing  methods.  They  feel  that  additional  infor¬ 
mation  is  available  fioiii  the  student’s  degree  of 
belief  probabilities.  A  mathematical  system  is 
presented  which  ensures  that  a  student  can  maxi¬ 
mize  his  expected  score  if  he  truly  reflects  his 
degree  of  belief  or  probability  that  a  specific 
response  choice  is  cwrect.  With  the  traditional 
procedure,  using  a  true-false  test  as  an  illustration, 
the  student  assigns  a  different  probability  for  cacli 
response  depending  on  his  state  of  knowledge.  If 
the  student  secs  the  probability  of  true  as  being 
greater  than  .50, .lie  should  choose  true;  but  if  th„ 
probability  is  less  than  .50,  he  should  choose  false; 
if  it  is  equal  to  50,  he  can  choose  either  response. 
Generally,  i  student  with  poor  knowledge  (/>  = 
.51)  will  get  the  same  score  (if  correct)  as  the 
person  with  good  knowledge  {p  =  .90);  therefore, 
the  choice  situation  loses  data  about  the  student’s 
knowledge.  In  confidence  testing,  the  student 
receives  a  confidence  scoie  (a  function  of  proba¬ 
bility)  ;r;iis  answer  is  correct  plus  a  score  for  the 
correct  answer.  In  addition,  the  student  can 
icccivc  credit  if  he  is  certain  that  his  response  is 
.ncorrcct  and  the  response  is.  in  fact,  inconcct.  In 
on*'  study  (Masscngil)  &.  Shuford.  1969).  using 
multiple-choice  tests,  confidence  was  divided 
among  the  choices  to  total  1.0.  The  subjects  for 
this  study  were  26  college-level  students.  1;  was 
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found  that  the  confidence  ratings  were  highly 
related  to  the  probability  of  their  answering  the 
questions  correctly. 

Gardner  (1970)  administered  a  course  pretest 
using  confidence  estimates  to  151  student  instruc¬ 
tors.  The  test  was  designed  .0  determine  necessary 
training  for  these  instructors,  liven  with  the 
confidence  scoring,  tlieie  was  no  significant  corre¬ 
lation  of  the  pretest  with  practice  teaching  or  with 
final  class  standing.  The  author  still  claims  that 
confidence  testing  yields  a  better  assessment  of 
student  knowledge,  as  well  as  higher  reliability. 

Coombs,  Mil holland,  and  Wonicr  (1956) 
present  another  method  of  assessing  additional 
student  knowledge.  Traditionally,  in  scoring  a 
four-choice  multiple-choice  question,  a  subject  is 
given  a  point  for  the  correct  answer  and  no  points 
for  a  choice  of  any  incorrect  answer  or  distractor. 
Partial  knowledge  exists  when  the  student  can 
identify  one  or  more  of  the  distractors.  Using  this 
technique,  in  a  multiple-choice  format,  one  point 
is  given  for  each  distractor  identified  and  three 
points  arc  subtracted  if  the  correct  answer  is 
identified  as  a  distractor.  Scores  on  each  four- 
choice  item  can  range  from  plus  three  to  minus 
three.  Partial-knowledge  testing,  then,  yields 
increased  item  and  test  variance  and  penalizes  for 
random  vrssing.  Two  possible  disadvantages  of 
this  meth; .»  '.re  that  it  is  not  applicable  to  all 
kinds  of  tests  ( c.g .,  true-false  tests),  and  the 
scotmg  is  time-consuming. 

Characteristics  of  Material  to  be 
Learned 

R.  Allison  (1960)  gave  13  different  learning 
tasks  to  31 5  enlisted  men  at  a  United  States  Naval 
Training  Center.  Thirty-nine  aptitude  and  achieve¬ 
ment  mcasuics  were  also  administered.  Rate, 
curvature,  and  speed  during  the  first  and  second 
half  of  the  task  were  used  as  criteria  of  learning. 
Using  factor  analytic  techniques,  Allison  found 
that  learning  was  organized  in  a  multidimensional 
way.  Therefore,  he  contended  that  learning  is  not 
a  single  trait,  but  contains  several  factors 
depending  “. .  .upon  the  psychological  process  in¬ 
volved  in  the  learning  task  and  the  content  of  the 
material  to  be  learned  (p.  Hi).”  Also,  the  aptitude 
and  achievement  measures  had  much  in  common 
"■'h  the  learning  measures,  demonstrating  that  the 
anility  to  apply  knowledge  and  the  acquhing  of 
knowledge  arc  very  similai. 

Naylor,  Briggs,  and  Reed  (1968)  found  that  a 
primary  task  (tracking)  is  performed  bette  in 
conjunction  with  a  coherent  or  meaningful 
secondary  task  (monitoring)  than  in  conjunction 


with  a  less  meaningful  or  coherent  task.  Therefore, 
secondary  task  coherence  can  affect  primary  task 
performance  in  dual  learning  situations. 

Wcitz  (1962,  1964)  determined  that  with 
different  difficulties  of  independent  variables  (c.g., 
amount  of  information  given  in  a  training  task), 
the  maximal  effect  on  transfer  of  training  will 
occur  cither  early  or  late  during  the  trials.  For  easy 
information  the  maximal  effect  occurs  early  and 
for  difficult  information  the  maximal  effect  occurs 
late. 

Underwood  (1969,  1970)  performed  several 
learning  experiments  which  demonstrate  a  break¬ 
down  of  the  total-time  law  which  states  that  the 
amount  learned  is  a  function  of  total  study  time. 
Pleven  experiments  were  perforated,  each  varying 
the  frequency  of  massed  and  distributed’practice. 
The  results  showed  that  (a)  recall  of  distributed 
practice  was  always  greater  than  recall  of  massed 
practice;  ( b )  massed  practice  words  which  were 
presented  with  the  same  exact  frequency  as  dis¬ 
tributed  practice  words  were  judged  to  have  been 
presented  less  frequently;  and  (r)  the  difference 
(in  recall)  between  massed  and  distributed  practice 
increased  as  the  frequency  of  repetition  increased. 
Underwood  hypothesizes  that  the  difference 
between  massed  and  distributed  practice  could  be 
due  to  a  failure  of  reception  under  massed  practice 
which  resulted  in  learning  as  if  under  a  less 
frequent  rate  of  presentation. 

Jensen  (1971)  gave  two  groups  of  high  school 
students  equivalent  forms  of  a  visual  and  auditory 
digit  span  test.  Both  forms  were  administered  to 
boti)  groups  in  a  counterbalanced  order  under 
immediate  and  10-sccond  delayed  recall  condi¬ 
tions.  Jensen  found  that  auditor)'  memory  was 
better  than  visual  memory  for  immediate  recall, 
but  that  'he  reverse  was  found  for  the  10-sccond 
delay  condition. 

Rather  than  viewing  instruction  as  merely 
presentation  of  information,  Whitmore  (1970a) 
feels  that  it  is  a  way  of  controlling  student 
behavior  so  that  learning  takes  place.  Some  factors 
which  affect  verbal  learning  arc  (a)  attention  span, 
(/>)  organization  of  the  material  into  meaningful 
units,  and  (0)  sequencing  of  materia!  (e.g.,  hierar¬ 
chical,  whole  ->  part,  and  general  -*  specific). 

Tarkhuff  (1569)  concluded  relative  to  coun¬ 
sellor  training  that  “.  .  .  those  programs  in  which 
high-level  functioning  traincis  focus  explicitly 
upon  dimensions  relevant  to  helper  gams  and  make 
systematic  employment  of  all  significant  sources 
of  learning,  including,  in  particular,  modeling,  arc 
most  effective  (p,  244).” 
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Composition  Scoring 

Fostvcdt  (1965)  constructed  several  criteria  for 
the  evaluation  of  high  school  English  compositions 
in  order  to  correct  for  non-uniformity  of  evalua¬ 
tion  standards  across  teachers.  Several  sources 
were  used  to  formulate  the  criteria:  (a)  coherence 
and  logic,  ( b )  development  of  ideas,  (c)  diction, 
( d)  organization,  and  (e)  emphasis.  A  sample  of 
college  English  experts  (N  =  9)  ranked  these  cri¬ 
teria.  Kendall’s  coefficient  of  concordance  was  .75 
ip  <  .01),  indicating  agreement  among  the  experts 
as  to  the  importance  of  each  criterion.  Next,  30 
English  teachers  were  asked  to  grade  20  themes  as 
“above  average,’’  “average,”  or  “below  average” 
on  each  criterion.  Analysis  of  variance  was  used  to 
test  criterion  reliability,  and  the  result  was  not 
statistically  significant  (p  >  .05);  therefore,  differ¬ 
ent  teachers  graded  the  same  themes  differently. 
Chi-square  tests  also  demonstrated  no  agreement; 
hence,  the  criteria  were  not  reliable  when  used  for 
grading  purposes. 

Bushan  and  Ginther  (1968)  feel  that  there  is  a 
good  deal  of  personal  bias  in  grading  essays  and 
that  a  more  objective  method  is  needed.  Differ¬ 
entiating  between  essays  should  take  into  account 

.  .  the  structure  and  length  of  the  sentence, 
vocabulary,  and  length  as  well  as  sociological  and 
psychological  construct  of  the  test  (p.  417).”  A 
computer  program  was  used  which  read  off  and 
quantified  several  relevant,  scorablc  variables  on 
1 1  University  of  Chicago  cssavs  which  were  also 
graded  by  three  experts.  The  three  best  and  three 
worst  essays  were  then  coded  for  the  computer 
and  so  analyzed.  Thirteen  criteria  were  employed 
to  determine  differences.  After  the  differences 
were  ascertained,  these  were  used  on  the  remgining 
five  essays.  Overall  results  demonstrated  that 
better  essay  writers  (a)  have  a  larger  vocabulary; 
( b )  include  statements  of  other  authorities  who  are 
named;  (c)  give  exact  dates  for  events;  (d)  use 
numbers  for  quantities;  and  ( c )  use  fewer  words 
from  psychological  categories  that  can  be  analyzed 
for  personality  differences. 

Testing 

Much  of  the  previous  discussion  in  this  chapter 
has  been  concerned  with  various  applications  of 
testing.  In  this  section,  testing  in  the  pure  sense  is 
discussed. 

Paper-and-pencil  tests,  as  the  name  implies,  arc 
tests  which  the  examinee  takes  with  a  printed  test 
and  a  pencil.  Most  tests  of  this  type  require  at  least 
some  reading  ability.  Some  types  of  paper-and- 
pencil  tests,  though,  require  no  reading  ility  at 
all.  Many  perceptual  speed  and  perceptual  motor 


tests  are  available  on  the  market.  Users  of  percep¬ 
tual  tests  feel  that  they  are  related  to  some 
performance  aspects  of  jobs.  The  verbal  type  of 
paper-and-pencil  test  should  be  used  only  in  jobs 
which  are  primarily  verbal  or  cognitive  in  content. 
It  would  probably  be  inappropriate  to  give  a 
paper-and-pencil  intelligence  test  or  e  vocabulary 
test  to  a  person  applying  for  a  mechanical  trade. 
Such  tests,  however,  would  be  appropriate  for 
some  clerical  positions.  In  performance  tests 
(Danzig  &  Keenan,  1956;  Fiske,  1954),  the  trainee 
or  employee  is  asked  to  perform  some  tasks  in 
which  the  content  is  relevant  to  his  present  or 
future  job.  Some  performance  tests  are.  less 
obviously  related  to  jobs  than  others.  Performance 
tests  can  range  from  dominoes,  mazes,  and  puzzles 
to  performance  of  job  tasks  using  real  job  equip¬ 
ment.  Perhaps  the  most  sophisticated  type  of 
performance  test  is  the  proving  ground.  In  the 
proving  ground  (McSheehy,  1959),  the  trainee  is 
placed  on  the  job.  An  attempt  is  made 'to  cycle 
him  through  all  the  job  tasks  in  a  short  period  of 
time.  As  he  performs  each  task,  the  trainee  is 
evaluated  and  he,  in  turn,  evaluates  the  training  in 
rlation  to  the  job. 

Statistical  Methods 

There  are  a  number  of  little  used  and  less 
understood  quantitative  methods  which  can  be 
useful  for  training  evaluation  and  student  achieve¬ 
ment  measurement. 

Partial  Correlation  and  Part 
Correlation 

Partial  correlation,  according  to  DuBois(1957) 
is  “.  .  .  the  Pearson  product-moment  correlation 
between  two  sets  of  residuals,  from  both  of  which 
variance  associated  with  the  same  set  of  independ¬ 
ent  variates  has  been  eliminated  (p.  192).”  In 
actual  practice  partial  correlation  is  used  to  hold 
one  or  more  extraneous  or  contaminating  variables 
constant.  For  example,  in  calculating  the  corre¬ 
lation  between  height  and  weight,  one  might  wish 
to  hold  age  and  sex  constant.  Part  correlation,  on 
the  other  hand,  is  “defined  as  the  Pearson 
product-moment  correlation  between  a  set  of 
residuals  on  one  hand  and  an  unmodified  variable 
on  the  other.  .  .  ”  In  studies  of  learning,  for 
example,  it  may  be  pertinent  to  inquire  into  the 
degree  to  which  final  standing  in  some  skill,  less 
the  variance  related  with  initial  standing,  is  related 
to  some  outside  predictor  varia.iie  (p.  60).”  The 
use  of  this  statistic  (part  correlation)  will  help  to 
clarify  some  of  the  problems  associated  with  the 
use  of  raw  gain  scores  mentioned  earlier  in  this 
chapter. 
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Factor  Analysis 

Factor  analysis  is  simply  a  statistical  method 
for  eliminating  the  redundancy  present  in  correla¬ 
tion  matrices.  One  might,  for  example,  be  able  ,'o 
reduce  a  20  by  20  correlation  matrix  to  a  20  by  5 
factor  matrix,  thus  using  only  five  factors  rather 
than  20  items  to  describe  the  matrix. 

Obviously,  factor  analysis  can  be  a  useful  tool 
in  training  evaluation  and  student  achievement 
measurement.  For  example,  one  might  have  a  1 5- 
item  rating  scale  which  measures  on-the-job 
behavior  of  training  school  graduates.  It  would  be 
inappropriate  to  describe  the  on-the-job  behavior 
of  these  men  in  terms  of  either  15  separate 
dimensions  or  one  overall  composite  when  the 
15-item  rating  scale  might  be  reduced  to  three  or 
four  dimensions  which  more  parsimoniously 
describe  on-the-job  behavior.  If  predictor  tests 
were  used,  then,  significant  validity  coefficients 
might  be  dependent  upon  whether  or  not  one  used 
factor  analysis.  Bergman  (1970)  had  such  an 
experience  when  attempting  to  predict  the 
behavior  of  139  oil  company  salesmen. 

Another  old  technique,  but  one  which  will 
probably  be  used  more  frequently  during  the  next 
decade,  is  Q-factor  analysis.  In  performing  a 
Q-factor  analysis,  one  simply  factor  analyzes  the 
matrix  of  person  correlations  rather  than  item 
correlations.  This  method  car.  be  useful  for 
grouping  persons  who  think  or  behave  similarly. 
For  example,  when  constructing  a  training  pro¬ 
gram,  it  may  be  useful  to  know  the  different 
cognitive  styles  of  the  potential  trainees  so  that 
the  training  can  be  adapted  to  the  needs  of  each 
homogeneous  group.  Eddy,  Glad,  and  Wilkins 
(1967)  used  Q-factor  analysis  and  found  that  their 
training  program  differentially  affected 
“.  .  .  .students  depending  upon  their  own  goals, 
attitudes,  and  characteristics  and  of  their  work 
environments  (p.  23).” 

Tucker  (1966)  recently  presented  a  rather 
unique  application  of  factor  analysis  to  the 
measurement  of  student  learning.  His  innovation, 
though,  has  undeservedly  been  ignored  by  all  but  a 
few  members  of  the  behavioral  science  com¬ 
munity.  Using  the  Ekhart-Young  theorem  (a 
fundamental  matrix  decomposition  theorem  of 
factor  analysis),  Tucker  found  that  individuals 
learn  in  qualitatively  different  ways  over  trials 
such  that  individuals  can  be  grouped  or  clustered 
according  to  the  way  they  perform  or  lean). 
Tucker  would  not  use  a  single,  homogeneous  learn¬ 
ing  curve  to  describe  what  is,  in  fact,  a  heterogene¬ 
ous  phenomenon. 


Canonical  Correlation 

Canonical  correlation  is  an  extension  of  factor 
analysis  to  the  situation  in  which  two  separate  sets 
of  variables  exist.  The  first  canonical  correlation  is 
the  highest  correlation  between  a  principal  com¬ 
ponent  of  the  first  set  of  variables  with  a  principal 
component  of  the  second  set  of  variables.  The 
second  canonical  correlation  is  the  correlation 
between  a  second  principal  component  of  the  first 
set  of  variables  with  a  second  principal  component 
of  the  second  set  of  variables.  Canonical  correl¬ 
ations  arc  continually  extracted  until  all  the 
common  variance  between  both  sets  of  variables  is 
accounted  for.  The  method  is  most  applicable 
when  there  arc  two  separate  sets  of  variables:  for 
example,  one  set  of  predictor  variables  and  one  set 
of  criterion  variables. 

Moderator  Variables 

A  test  is  a  moderator  when  its  score  differen¬ 
tially  determines  the  predictability  of  another  test 
or  variable.  For  example,  one  may  be  able  to 
adequately  predict  the  performance  of  college 
students  using  an  intelligence  test  for  those  who 
score  high  on  a  test  of  achievement  motivation, 
but  not  for  those  who  score  low  on  the  test  of 
achievement  motivation.  Race  is  one  of  the  more 
currently  popular  moderator  variables.  Much 
recent  research  has  shown  that  employment  tests 
arc  differentially  predictive  across  racial  groups, 
thus  supporting  the  contention  that  common 
selection  standards  for  Negroes  and  whims  are 
inappropriate  or  unfair.  Moderator  variab!  are 
sufficiently  important  to  student  achievement 
measurement  and  training  evaluation  that  they  are 
given  separate  treatment  in  another  chapter  of  this 
review. 

Convergent  and  Discriminant  Validity 

Campbell  and  Fiske  (1959)  would  define  con¬ 
vergent  validity  as  a  high  correlation  between  tests 
purporting  to  measure  the  same  thing,  while  dis¬ 
criminant  validity  would  refer  to  independence  of 
tests  measuring  different  factors.  The  one  criterion 
for  convergent  validity  is  that  the  correlations 
between  several  tests  measuring  one  trait  must  be 
significantly  greater  than  zero  (mono-trait  hetero- 
method  correlation).  For  discriminant  validity, 
three  criteria  must  be  met:  (n)  The  single-trait- 
multimcthod  correlations  must  be  significantly 
greater  than  the  correlations  not  having  trait  or 
method  in  common;  ( b )  the  single-trait- 
multimethod  correlation  should  be  significantly 
higher  than  different  traits  measured  by  the  same 
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method;  and  (c)  there  should  be  a  stable  pattern  of 
trait  interrelationship  regardless  of  the  method 
used. 

Campbell  and  Fiske  advocate  the  use  of  a 
rnultitrait-multimethod  matrix  which  is  in  reality 
confusing  and  unnecessary,  since  all  that  is 
required  is  understanding  of  the  concepts  involved. 
Dielman  and  Wilson  (1970)  and  Kavanagh, 
MacKinney,  and  Wolins  (1971)  are  among  those 
who  have  successfully  applied  this  technique. 

Internal  and  External  Validity 

Campbell  and  Stanley  (1963)  define  internal 
validity  as  “significance,”  and  external  validity  as 
measured  change  in  job  behavior.  Campbell, 
Dunnette,  Lawler,  and  Weick  (1970)  indicate  that 
internal  criteria  are  those  that  are  directly  tied  to 
training  behavior  and  that  external  criteria  meas¬ 
ure  subsequent  change  in  job  behavior. 

Campbell  and  Stanley  (1963)  and  Winch  and 
Campbell  (1969)  provide  an  exhaustive  list  of 
“threats”  to  internal  and  external  validity.  The 
threats  to  internal  and  external  validity  are  (a) 
history  or  antecedents,  ( b )  maturation  of  subjects, 
(< c )  testing  effects,  (d)  instrumentation,  (e)  statis¬ 
tical  regression  (extreme  scores),  (f)  differential 
selection  of  comparison  groups,  (g)  experimental 
mortality,  ( h )  selection-maturation  interaction,  (/) 
pretest  sensitization,  (/)  interaction  between  selec¬ 
tion  bias  and  the  experimental  variable,  (k) 
instability  and  unreliability  of  measures,  (/)  condi¬ 
tions  making  the  experimental  setting  atypical  or 
artificial,  (ml  multi-treatment  interference,  (/i) 
irrelevant  components  of  complex  measures,  (o) 
failure  to  replicate  entire  relevant  parts  of  the 
experiment,  (p )  effects  of  experimental  arrange¬ 
ments,  and  {■'{/  effects  of  prior  treatments.  These 
writers  recommend  the  use  of  experimental 
designs  and  statistical  treatments  which  minimize 
the  effects  of  these  variables. 

To  assess  effects  of  training,  Campbell, 
Dunnette,  Lawler,  and  Weick  (1970)  recom¬ 
mended  using  the  following  experimental 


paradigm; 

Subject 

Pre- 

measure 

Training 

Post* 

measure 

Group  1 

Yes 

Yes 

Yes 

Group  11 

Yes 

Placebo 

Yes 

Group  III 

Yes 

No 

Yes 

Group  IV 

No 

No 

Yes 

In  this  design,  the  placebo  group  is  necessary 
because  the  measureable  effects  of  training  can  be 
attributed  to  the  “Hawthorne  effect.”  The  post¬ 


test  group  (IV)  is  needed  to  avoid  the  possible 
effects  of  pretest  sensitization. 

Scaling  Techniques 

Siegel  and  Schultz  (1960),  Siegel,  Schultz,  and 
Benson  (1960),  and  Schultz  and  Siegel  (1961a, 
1961b)  report  the  use  of  scaled  behavioral  check¬ 
lists  to  evaluate  job  performance  in  several  Naval 
job  specialties.  These  lists,  developed  on  the  basis 
of  Thurstone  and  Guttman  scaling  principles, 
allow  one  to  evaluate  a  man’s  proficiency  by 
checking  just  one  task  on  a  list.  If  he  can  perform 
that  task,  it  can  be  assumed  that  he  can  perform 
all  tasks  below  that  level  on  the  scale. 

Stone  and  Sinnett  (1968)  sought  to  determine 
whether  or  not  the  four-point  grade  point  average 
distribution  can  be  represented  as  being  an  equal 
interval  scale.  Thirty-six  members  of  the  Univer¬ 
sity  of  North  Dakota  were  used  as  judges.  The 
grade  range  of  A  to  F  was  divided  into  12 
intervals,  c.g.,  F  to  F+,  F+  to  D",  D“  to  D,  D  to 

D+ . A”  to  A.  The  judges  were  then  asked 

to  choose  the  grade  intervals  they  thought  were 
larger.  They  used  the  paired-comparison  technique 
to  rank  all  intervals.  The  median  coefficient  of 
consistency  for  all  judges  was  -.83.  A  scale  was  then 
constructed  using  Thurstone  techniques.  The 
results  of  this  scaling  analysis  were  that  (a)  the 
judged  scale  was  found  to  be  a  logarithmic  scale 
which  could  be  compared  to  the  grade  point 
average  scale;  (b)  generally,  the  intervals  were 
judged  to  be  smaller  as  the  grade  levels  decreased; 
(c)  the  midpoint  of  the  scale  was  between  C+  and 
B~;  ( d)  the  distance  between  the  midpoint  of  the 
,'Pude  to  the  (+)  point  appeared  larger  than  from 
the  (-)  point  to  the  midpoint;  and  ( e )  intervals 
containing  a  grade  boundary  were  judged  larger 
than  those  within  a  grade  (c.g.,  C.  to  B-  was 
thought  greater  than  C  to  C*). 

Schultz  and  Siegel  (1962a,  1962b)  used  multi¬ 
dimensional  scaling  analysis  which  integrates 
psychophysical  judgments  and  factor  analysis.  The 
procedure  is  “  .  .  .  obtaining  a  matrix  of  inter- 
stimulus  distances  (psychophysical  judgments) 
and  .  .  .  determining  the  dimensionality  of  the 
space  containing  the  stimulus  points  (p.  3).”  This 
method  recognizes  the  multidimcnsionality,  as 
opposed  to  the  unidimcnsionality,  of  job  perform¬ 
ance  criteria.  Eighteen  tasks  performed  by  the 
avionics  electronics  technician  were  delineated. 
Judges  were  then  required  to  indicate,  along  a 
scale,  the  distance  or  similarity  between  all 
possible  pairs  of  tasks.  After  the  analysis  was 
completed,  four  job  dimensions  were  found;  (a) 
electro-comprehension,  ( b )  equipment  operation 
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and  inspection,  (c)  electro-repair,  and'(ri)  electro¬ 
safety.  Schultz  and  Siegel  (1964)  then  used  these 
four  dimensions  to  construct  unidimensional  scales 
via  Thurstone  and  Guttman  techniques.  Siegel  and 
Schultz  (1963)  and  Schultz  and  Siegel  (1963)  also 
applied  multidimensional  scaling  analysis  to  classi¬ 
fication  of  circuit  types  and  to  the  Naval  aviation 
electronics  technician  supervisor  rating. 

Signal  Detection 

Siegel  and  Pfeiffer  (1969)  and  Siegel,  Fischl, 
and  Pficffcr(1968)  were  successfully  able  to  apply 
signal  detection  theory  to  the  prediction  of 
academic  success  in  both  a  military  and  a  college 
setting.  Signal  detection  theory  “  .  .  .  provides  a 
way  of  controlling  and  measuring  the  criterion  the 
observer  uses  in  making  decisions  about  signal 
existence  and  provides  a  measure  of  the  observer 
detection  sensitivity  (d')  that  is  independent  of  his 
decision  criterion  (p.  145).”  Eighteen  subjects  in 
Naval  electronics  training  were  divided  into 
journeyman,  intermediate,  and  advanced  levels  of 
training.  Also,  40  male  college  sophomores  were 
divided  into  high  grade  point  average  (2.88)  and 
low  grade  point  average  (1.67)  groups.  Tire  college 
sample  was  even  a  49-item  (psychology)  true-false 
test,  and  the  military  sample  was  given  a  23-item 
(circuitry)  test.  Items  that  are  answered  true  <ue 
considered  signal  while  items  answered  false  •' 
considered  noise.  A  sensitive  observer  is  oik  * 1  o 
differentiates  with  few  errors  between  siv'i  I 
noise.  The  results  of  this  study  were  that  < 
was  2.16  for  the  high  grade  point  average  sf  ’ents, 
and  1.58  for  the  low  grade  point  average  students; 
(b)  Naval  technicians  with  the  least  training  and 
experience  had  a  d'  of  .64,  while  those  with  the 
most  training  and  experience  had  a  d'  of  3.20;  (c) 
analysis  of  variance  results  were  significant  for 
both  groups  at  p  <  .01 ;  ((/)  Scholastic  Aptitude 
Test  (SAT)  scores  were  related  to  the  college 
sample  grade  point  averages;  ( c )  other  academic 
predictors  did  not  correlate  significantly  with  d', 
suggesting  that  it  measures  a  different  basic 
process;  (J)  SAT  scores  accounted  for  16  percent 
of  the  high  grade  point  average  variance  and  13 
percent  of  the  low  grade  point  average  variance; 
but  with  the  addition  of  a,  the  predictable  vari¬ 
ance  increased  to  33  percent  and  51  percent, 
respectively;  and  (g)  the  variance  accounted  for  by 
the  military  tests  was  1 1  percent,  but  it  increased 
to  50  percent  with  the  addition  of  d'.  The  authors 
conclude  that  d*  can  be  used  both  as  a  predictor  of 
performance  and  as  a  measure  of  training  success. 

The  theory  of  signal  detection  bears  an  obvious 
relationship  to  the  previously  mentioned  concept 


of  confidence  testing.  Test  scores  based  on 
confidence  testing  should  correlate  higher  with 
signal  detection  variables  (d')  than  with  traditional 
test  scores.  Indeed,  several  investigators  (Clarke, 
1964;  Pollack  &  Decker,  1964)  have  used  confi¬ 
dence  estimates  in  their  signal  detection  studies. 
Signal  detection,  multidimensional  scaling,  and 
confidence  testing  all  derive  from  experiments 
based  upon  psychophysical  principles  which  are 
discussed  in  the  next  section. 

Psychophysics 

Siegel  and  Fcderman  (1970)  combined  the 
magnitude  estimation  technique  with  peer  group 
ratings  to  arrive  at  a  novel  method  of  performance 
evaluation.  The  subjects  for  this  experiment  (Af  = 
20)  were  two  groups  ol  10  avionics  technicians. 
Each  man  was  asked  to  estimate  tire  number  of 
uncommonly  ineffective  and  uncommonly 
effective  performances  across  nine  performance 
dimensions  for  the  nine  other  men  over  a  specified 
period  of  time.  The  ratio  of  the  number  of  uncom- 
r  -only  effective  (UE)  performances  divided  by  the 
umber  of  uncommonly  effective  performances 
plus  the  number  of  uncommonly  ineffective  (UI) 
performances  (SUE/EUE  +  SUI)  yields  an  index 
which  varies  between  zero  and  one.  One  of  the 
wo  groups  was  more  experienced  than  the  other, 
and  this  technique  was  able  to  differentiate 
between  them. 

In  addition  to  the  aforementioned  study,  Siegel 
and  Iris  associates  at  Applied  Psychological  Serv¬ 
ices  have  over  the  years  applied  the  classical 
psychophysical  methods  to  several  other  aspects  of 
military  and  performance  evaluation.  Terminal 
threshold  concepts  were  applied  to  electronics 
troubleshooting  performance  evaluation  (Siegel, 
1968).  Psychophysical  methods  were  used  to 
maximize  the  probability  of  operator  malfunction 
recognition  (Michle  &  Siegel,  1967).  Activity 
circuit  interactions  were  related  to  perceived 
circuit  complexity  (Pfeiffer  &  Siegel,  1967b). 
Magnitude  estimation  and  the  structure  of  intellect 
model  were  used  to  relate  clcctionics  maintenance 
job  activities  and  the  intellective  scale  values  of 
these  activities  (Pfeiffer  &  Siegel,  1967a).  The 
psychological  relationship  between  perceived 
circuit  complexity  and  a  physical  measure  of 
circuit  complexity  was  ascertained  (Pfeiffer  & 
Siegel,  1966).  Magnitude  estimates  of  perceived 
circuit  complexity  were  related  to  subjective  and 
objective  job  correlates  (Siegel  &  Pfeiffer,  1966b). 
Magnitude  estimation  was  used  to  measure 
avionics  maintenance  personnel  subsystem  relia¬ 
bility  (Siegel  &  Pfeiffer,  1966a).  And,  finally. 
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magnitude  and  category  psychophysical  scaling 
methods  were  used  by  journeyman  electronics 
personnel  to  scale  the  complexity  of  various 
aspects  of  their  own  jobs  (Pfeiffer  &  Siegel,  1965). 

Summary 

The  first  section  of  this  chapter  presented  an 
overview  of  some  of  the  kinds  and  characteristics 
of  dependent  measures  used  in  training  evaluation 
and  student  achievement  measurement.  The  test 
construction  portion  of  this  chapter  contained  a 
brief  discussion  of  the  steps  to  be  followed  in 
constructing  a  test  plus  some  studies  using  novel 
tests  or  testing  techniques.  Other  topics  reviewed 
in  this  chapter  were  (a)  hierarchical  and  sequential 
testing,  ( b )  criterion-  and  norm-referenced  testing, 
(c)  performance  evaluation  problems,  (d)  cost 
effectiveness,  (<?)  gain  scores  and  final  examination 
grades,  (f)  confidence  testing  and  partial  knowl¬ 
edge,  (g)  characteristics  of  the  material  to  be 
learned,  (h)  composition  scoring,  and  (/)  statistical 
methods. 


IV.  LEARNING  STYLES  AND  MODERATOR 
VARIABLES 

Scope  of  the  Problem 

The  sensitivity  and  predictive  power  of  student 
measurement  and  training  evaluation  techniques 
can  often  be  increased  through  the  use  of  modera¬ 
tor  variables.  This  is  because  certain  attributes  of 
select  groups  tend  to  make  the  testing  evaluation 
methods  more  or  less  appropriate  for  the  groups. 
Some  of  the  factors  which  can  be  used  as  modera¬ 
tors  arc  (a)  achievement  level,  ( b )  personal  and 
environmental  variables,  (c)  social  background 
factors,  (d)  cognitive  style,  and  (e)  affective 
reactions. 

Cognitive  styles  arc  modes  of  thought,  percep¬ 
tion,  and  ..lemory;  they  are  also  information 
processing  habits.  Some  of  tire  various  types  of 
cognitive  stylos  that  have  been  identified  are  (a) 
field  dependence-independence,  (b)  attention  span 
(or  span  of  awareness),  ( c )  breadth  of  categorizing 
(e.g.,  lumpers  and  splitters),  (</)  conceptual  styles 
( e.g .,  modes  of  categorization),  ( e )  complexity 
versus  simplicity  in  word  perception,  (/) 
reflective-impulsive,  ( g )  leveling  versus  sharpening, 
(/i)  susceptability  to  cognitive  interference,  and(0 
ability  to  accept  unrealistic  experiences.  French 
(1963),  using  a  factor  analytic  approach,  delin¬ 
eated  two  types  of  problem  solvers:  (a)  those  using 
a  systematizing  approach  and  ( b )  those  using  a 
scanning  approach. 


Rundquist  (1969)  contends  that  item  analysis, 
factor  analysis,  and  moderator  variables  have  not 
helped  to  increase  predictive  efficiency  because 
these  various  methods  fail  to  take  into  account  the 
fact  that  different  antecedents  can  produce  the 
same  behavior  across  individuals  (e.g.,  visual  recall 
via  eidetic  imagery  or  by  short  term  memory). 
According  to  Rundquist,  one  must  learn  the 
mediating  processes  used  by  individuals  in  learning 
to  do  a  job  and  then  construct  tests  for  the  ante¬ 
cedent  behaviors.  These  new  tests  would  be  better 
measures  of  an  ability  than  more  global  tests,  and 
they  could  avoid  confounding  effects.  The  new 
test  or  measure  may  be  slanted  more  toward  one 
antecedent  than  another,  thus  increasing  the 
validity  coefficient. 

The  overall  trend  towards  individualization  has 
caused  some  writers  (Whitla,  1969)  to  plead  for 
more  research  on  student  types,  class  mix,  and  the 
disadvantaged.  Others  (Bligh,  1965)  have  called  for 
increased  differentiation  of  norms  for  different 
groups  (e.g.,  sex,  race,  locale).  Finally,  some  others 
(Project  Impact,  1970)  claim  that  computer 
assisted  instruction  and  other  forms  of  individ¬ 
ualized  instruction  are  the  best  way  to  account  for 
broad  student  differences. 

On  the  debit  side,  Gagne  (1968)  disputes  the 
existence  of  learning  styles.  He  thinks  computer 
assisted  instmetion  puts  too  much  stress  on  the 
machine  rather  than  on  the  student.  He  docs, 
though,  emphasize  tire  need  for  individualized 
instmetion,  and  lie  acknowledges  the  idiosyncratic 
nature  of  the  student.  Cohen  (1970)  feels  that  one 
must  be  careful  when  using  cognitive  styles  as 
moderators  and  instructional  aids,  since  they  can 
change  over  time.  For  example,  much  of  Piaget’s 
work  has  shown  that  the  child’s  problem  solving 
style  and  conceptual  mode  of  thinking  will 
qualitatively  change  from  infancy  to  adulthood. 
Cohen  concludes  tiiat  a  valid  decision  about  an 
individual’s  cognitive  style  at  one  time  may  prove 
to  be  invalicfat  another  time. 

One  final  note  concerns  the  special  case  of  the 
moderator  variable  approach  when  aptitudes  or 
aptitude  test  scores  interact.  When  this  occurs, 
differential  treatment  of  groups  is  mandatory,  if 
not,  erroneous  or  contaminated  results  will  occur. 


There  has  been  a  plethora  of  recent  research 
emphasizing  the  effects  of  differential  motivation 
and  differential  thinking  styles  (erroneously 
termed  “intelligence”)  on  student  achievement. 


Motivation  and  Types  of 
InteDigcnce 
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These  concepts  certainly  should  be  held  in  mind 
by  anyone  concerned  with  student  achievement, 
from  either  the  measurement  or  the  instructional 
point  of  view.  However,  the  payoff  of  the  studies 
in  these  areas  seems,  as  yet,  indeterminate  and 
problematical.  Many  of  the  studies  are  contradic¬ 
tory  in  results,  and  others  require  cross  validation 
before  their  indications  can  be  fully  exploited. 

Jensen  (1969)  postulates  that  there  are  two 
types  of  intelligence,  abstract  and  associative,  and 
that  instruction  and  testing  should  be  differen¬ 
tially  tailored  to  suit  these  different  modes  of 
learning. 

Rimland  ( 1 969)  also  suggests  that  there  are  two 
types  of  intelligence,  practical  and  abstract. 
Rimland  hypothesizes  that  practical  intelligence  is 
needed  for  job  performance,  and  that  abstract 
intelligence  is  needed  for  academic  work.  Such 
thinking  would  imply  that  most  trade  schools 
should  rely  heavily  on  job  performance  testing  to 
measure  student  achievement.  Rimland  says  that 
the  traditional  g,  or  general  intelligence  factor, 
measures  “intracerebral  events,”  or  the  ability  to 
abstractly  manipulate  symbols  and  events  in  the 
head.  This  is  the  ability  required  of  test  takers. 
Others  are  better  at  “cxtraccrcbral  events,”  or  the 
ability  to  sustain  attention  on  and  perform  simple 
tasks  which  simulate  the  job  (e.g,  perceptual 
speed).  Rimland  posits  that  these  two  types  of 
intelligence  are  mutually  exclusive.  In  his  research, 
lie  found  that  intelligence  test  scores  correlated 
much  higher  with  school  grades  than  did  perform¬ 
ance  test  scores,  but  that  performance  test  scores 
correlated  much  higher  with  job  performance  than 
did  intelligence  test  scores.  He  concludes  that 
different  types  of  training  and  separate  types  of 
measurement  are  needed  for  students  with  differ¬ 
ent  types  of  intelligence. 

Rotter  (1966)  conceives  the  effect  of  reinforce¬ 
ment  on  behavior  as  dependent  on  whether  the 
person  perceives  a  causal  relationship  between  his 
own  behavior  and  tire  reward.  If  not,  the  result  is 
attributed  to  luck  or  to  the  control  of  others. 
Internal  control  exists  when  the  student  thinks 
reinforcement  is  contingent  upon  his  own 
behavior,  while  external  control  is  when  the 
student  thinks  reinforcement  is  controlled  by 
others  or  by  chance  events. 

in  one  study  investigating  the  internal-external 
control  thesis  (Scott  &  Phelan,  1969),  three  groups 
of  hard  core  uncmployablcs  were  tested  with 
Rotter’s  Internal-External  Control  Scale.  The 
subjects  in  all  three  groups  were  matched  on  age, 
socioeconomic  status,  and  scholastic  antitudes. 


The  results  demonstrated  that  black  anu  Mexican 
American  subjects  demonstrated  greater  external 
control  than  did  white  subjects.  The  authors 
concluded  that  the  externally  controlled  subjects 
did  not  fee!  that  there  was  a  relationship  between 
individual  effort  and  reward;  therefore,  they  did 
not  work  unless  given  external  reinforcement  (e.g., 
praise,  money). 

Atkinson  (1966)  presents  a  somewhat  more 
vigorous  theory  of  motivation  involving  achieve¬ 
ment  motivation,  incentive,  and  goal  expectancy. 
Atkinson’s  theory  is  depicted  by  the  formula: 

Motivation  =  f(niotivc  x  expectancy  x  incentive) 

With  motivation  to  approach  a  goal  (nAch)  held 
constant  at  1.00  and  with  expectancy  and  incen¬ 
tive  equal  to  .5,  then  the  probability  of  goal 
approach  is  .25  (the  highest  possible).  Atkinson 
defines  incentive  as  the  goal  attractiveness,  and 
motive  as  the  ability  to  strive  for  satisfaction  or  to 
accomplish.  “The  strength  of  motivation  to 
approach  decreases  as  probability  of  success 
increases  from  .50  to  near  certainty  (/>s  =  .90).  and 
it  also  decreases  as  ps  decreases  from  .50  to  cer¬ 
tainty  of  failure  (ps  =  .  i  0)  (p.  1 7).” 

From  this  formulation,  it  is  easily  seen  that  the 
young,  deprived  black  child  will  rarely  encounter  a 
probability  of  success  of  .5  or  greater.  Because  he 
perceives  a  certainty  of  failure,  he  then  lacks  the 
motivation  to  approach  a  goal:  therefore,  he  docs 
not  perform  as  well  in  student  measurement  situa¬ 
tions  as  the  non-deprived  white  child  who 
perceives  a  higher  probability  of  success. 

Katz  (1967)  more  or  less  integrates  the  two 
earlier  theories  into  a  coherent  two-stage  theory  of 
development  which  possesses  implications  for 
student  measurement.  During  the  first  stage  (up  to 
two  years  of  age)  of  development,  the  child’s 
verbal  efforts  arc  normally  reinforced  by  parental 
approval.  Selective  approval,  on  the  part  of  the 
parents,  can  develop  strong  habits  of  striving  for 
proficiency  in  the  child.  During  the  second  stage, 
the  parental  standards  and  values  of  achievement 
are  internalized  by  the  child.  “The  child’s  own 
implicit  verbal  responses  acquire  through  repeated 
association  with  the  overt  responses  of  the  parents, 
the  same  power  to  guide  and  reinforce  the  child’s 
own  achievement  behaviors  ....  Internaliz¬ 
ation  doesn’t  take  place  until  strong  externally 
reinforced  achieving  habits  have  developed  (p.  5).” 
Lower  class  children  (including  most  blacks)  arc 
more  dependent  upon  others  for  social  reinforce¬ 
ment  in  academic  situations.  Lacking  internaliza¬ 
tion,  they  will  avoid  achievement  situations  and 
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concentrate  on  other  situations  regarded  as  more 
promising.  “Lower  class  Negro  children  tend  to  be 
externally  oriented  in  situations  that  demand 
performance.  That  is,  they  arc  likely  to  be  highly 
dependent  on  the  immediate  environment  for  the 
setting  of  standards  and  the  dispensing  of  rewards 
(p.8)” 

Hess  and  Shipman  (1965)  present  a  very 
interesting  and  alternative  developmental 
formulation.  They  feel  that  cognitive  growth  is 
“  .  .  .  fostered  in  family  control  systems  which 
offer  and  permit  a  wide  range  of  alternatives  of 
action  and  thought  and  that  such  growth  is  con¬ 
stricted  by  systems  of  control  which  offer  pre¬ 
determined  solutions  and  few  alternatives  for 
consideration  and  choice  (p.  870).”  In  the 
deprived  family  context,  the  parent-child  control 
system  “  .  .  .  restricts  the  number  and  kind  of 
alternatives  for  action  and  thought  that  arc  opened 
to  the  child;  such  constriction  precludes  a 
tendency  for  the  child  to  reflect,  to  consider  and 
choose  among  alternatives  for  speech  and  action. 
It  develops  modes  for  dealing  with  stimuli  and 
with  problems  which  arc  inclusive  rather  titan 
reflective,  which  deal  with  the  immediate  rather 
than  the  future,  and  which  arc  disconnected  rather 
than  sequential  (pp.  870-871).”  Hess  and  Shipman 
performed  a  research  study  using  deprived  (black) 
and  non-deprived  mother  and  child  pairs  which 
supported  their  hypotheses.  These  authors 
concluded  that  the  family  shapes  the  modes  of 
communication  in  the  child,  which  in  turn  shape 
his  thought  and  problem  solving  style. 

In  summation,  these  four  positions  suggest  that, 
in  both  curriculum  development  and  student 
measurement,  differences  in  cognitive  style  and 
motivation  must  be  accounted  for  in  any  program 
which  purports  to  be  at  all  comprehensive. 

Race  and  Aptitude  as  Moderator 
Variables 

In  a  recent  survey  of  13  studies,  Boclun  (1971) 
found  that  job  knowledge  and  performance  test 
criteria  always  yielded  the  highest  validities. 
Generally,  there  arc  fewer  validity  differences 
between  racial  groups  when  these  more  objective 
criteria  arc  used  instead  of  ratings  or  rankings. 

McFann  (1969a,  1969b)  noted  that  the  differ¬ 
ences  between  high-  and  low-aptitude  men  in  Basic 
Combat  Training  were  greatest  on  cognitive  tasks, 
and  that  the  differences  were  not  as  marked  on 
motor  skills  and  proficiency  tests.  In  a  project 
SPECTRUM  study,  high-,  middle-,  and  low- 
aptitude  groups  were  selected,  and  individualized 


training  was  given  using  videotape,  one-to-one 
student-teacher  ratio,  feedback,  reinforcement, 
and  small  increments.  In  some  tasks,  low-aptitude 
men  reached  standard  but  took  two  to  four  times 
longer;  in  other  cases  they  did  not  master  the 
material  at  all.  McFann  also  found  that  high- 
aptitude  groups  learned  equally  well  with  lecture 
or  individualized  training,  while  low-aptitude 
groups  learned  well  with  individualized  training, 
but  not  with  lecture. 

Foley  (1971)  wanted  to  determine  if  the 
Officer  Qualification  Test  (OQT)  was  biased 
against  blacks  in  determining  final  Officer  Candi¬ 
date  School  (OCS)  grade  point  averages.  The  final 
OCS  grades  of  blacks  from  Caucasian  colleges  were 
not  significantly  different  from  a  matched  white 
sample.  Blacks  from  Negro  colleges,  though,  did 
receive  significantly  different  grades  than  their 
matched  white  subjects  ( p  <  .005).  In  general,  the 
OQT  predicted  better  for  the  white  sample,  even 
though  it  was  significant  for  both  races. 

Guinn,  Tupes,  and  Alley  (1970a,  1970b) 
wished  to  determine  if  the  prediction  of  training 
success  varied  across  subgroups.  If  this  is  the  case, 
then  overall  predictive  efficiency  suffers.  These 
writers  found  differences  in  training  performance 
across  race,  area  of  the  country,  and  education.  All 
three  differences,  though,  were  not  found  in  all 
occupational  specialties.  It  can  be  inferred  from 
these  results  that  factors  such  as  race  and  vari¬ 
ations  in  cultural  opportunity,  as  may  exist  across 
different  educational  and  regional  groups,  can 
account  for  tire  differences  in  test  scores  across 
groups. 

In  a  study  performed  at  the  American  Tele¬ 
phone  and  Telegraph  Company  (Grant  &  Bray, 
1970),  task  proficiency  after  training  was  used  as  a 
criterion  because  the  investigators  thought  that  it 
was  uninfluenced  by  supervisory  bias,  peer  pres¬ 
sure  to  control  output,  and  motivation. 

Five  hundred  subjects,  both  blacks  and  whites, 
who  met  and  failed  to  meet  normal  selection 
standards  were  involved.  Seven  hierarchical  levels 
of  training  were  employed  using  tasks  regularly 
performed  by  craftsmen.  Pretest  and  posttest  tasks 
were  given  at  each  level,  and  the  highest  level  com¬ 
pleted  was  tire  criterion.  The  results  demonstrated 
that  all  selection  instruments  correlated  with 
highest  level  passed,  and  there  were  no  differences 
in  minority  and  non-minority  correlations.  The 
School  and  College  Abilities  Test  plus  a  test  of 
a  act  reasoning  yielded  a  multiple  R  of  .49 
when  correlated  with  the  training  criterion. 
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Age  and  Sex  as  Moderators 

Using  the  Gates  Reading  Readiness  Test  and  the 
Metropolitan  Achievement  Test  for  elementary 
school  students,  Miller  and  Norris  (1967)  found 
that  younger  school  entrants  were  at  a  disadvan¬ 
tage  at  the  start.  This  effect,  though,  disappeared 
after  the  first  grade.  The  late  entering  group 
tended  to  have  more  achievement  and  psychologi¬ 
cal  referral  problems  than  the  early  and  normal 
entrant  group. 

Gay  (1969)  investigated  the  differential  effect¬ 
iveness  for  males  and  females  of  three  computer 
assisted  instruction  (CA1)  treatments  on  delayed 
retention  of  mathematical  concepts.  The  three 
methods  of  presentation  were  (a)  “variable 
example”  which  depends  on  the  subject’s  pre¬ 
instruction  retention  index  as  measured  by  the 
Gay  Retention  Index;  ( b )  “choice”  which,  allows 
the  subject  to  decide  on  how  many  examples  he 
needs;  and  (c)  “fixed”  which  allows  the  subject 
three  trials  per  mathematical  concept.  Fifty-three 
eighth  grade  subjects  (27  male  and  26  female) 
were  randomly  assigned  to  the  treatments.  The 
results  indicated  that  (a)  the  females  in  the  vari¬ 
able  example  group  performed  better  than  the 
females  in  the  fixed  and  choice  example  groups  ( p 
<  .05);  (b)  males  in  the  choice  group  perfomied 
significantly  better  than  females  in  the  choice 
group  (p  <  .05);  (c)  males  in  the  choice  group 
perfomied  significantly  better  than  males  in  the 
variable  example  and  fixed  groups  (p  <  .05);  and 
(rf)  females  in  the  variable  example  group 
performed  better  than  males  in  the  variable 
example  and  fixed  groups.  Gay  concluded  that  the 
choice  method  is  best  for  males.  Even  though  the 
males  averaged  three  choices,  they  gave  more  trials 
to  the  difficult  items  and  fewer  trials  to  the  easier 
items.  The  Gay  Retention  Index,  though,  seemed 
to  be  good  for  selecting  the  number  of  items  for 
females. 

Cross-National  Evaluation 

Huscn  (1969)  discusses  cross-national  evalua¬ 
tion  and  points  out  that  such  evaluations  can  be 
confounded  because  of  a  difference  in  objectives, 
which  are  different  across  boundaries,  including 
different  traditions,  emphasis,  age  levels  of  intro¬ 
duction,  and  opportunity.  Huson  also  points  out 
that  the  real  purpose  of  cross-national  evaluation  is 
“.  .  .  not  to  make  overall  comparisons  between 
countries  -  we  arc  not  engaged  in  an  international 
contest  -  but  to  obtain  meaningful  comprehensive 
measures  of  both  cognitive  and  non-cognitive  out¬ 
comes  and  to  relate  these  to  a  comprehensive  set 
of  input  variables,  including  those  which  measure 


opportunity.  Thereby,  provisions  are  made  for  a 
fruitful  multivariate  analysis  of  how  outcomes  arc 
related  to  inputs  (p.  343).” 

Summary 

This  chapter  was  concerned  with  the  various 
effects  of  learning  styles  and  moderator  variables. 
First,  moderator  variables  were  defined  and 
discussed.  Following  this  was  a  presentation  of 
several  motivational  and  developmental  theories 
which  purport  to  lend  some  insight  into  how 
moderator  effects  materialize.  Additional  sections 
of  the  chapter  contained  studies  of  race  and  apti¬ 
tude  levels  as  moderator  variables;  age  and  sex  as 
moderators;  and  problems  of  cross-national  evalua¬ 
tion.  It  was  noted  that  although  the  moderator 
variable  approach  appears  to  possess  merit, 
moderators  arc  often  elusive.  Their  identification 
and  their  desirability  may  be  dependent  on  a  host 
of  inieractive  effects.  Thus,  although  no  advanced 
program  will  ignore  moderators,  one  should  not 
anticipate  that  they  will  provide  a  pat  solution  to 
prediction  problems. 


V.  CURRENT TRENDS 

Trends 

About  ten  years  ago,  Schultz  and  Siegel 
(1961a)  perceived  a  trend  in  evaluation  research 
which  has  since  been  demonstrated.  They  found 
that  rather  than  investigating  an  overall  perform¬ 
ance  criterion,  it  is  better  to  use  factor  analysis  or 
multidimensional  scaling  techniques  to  identify 
the  important  components  of  the  job  or  training 
task.  In  the  past,  there  has  been  too  heavy  a  reli¬ 
ance  placed  on  the  single  composite  criterion.  This 
practice  is  wasteful  and  hides  useful  information. 
More  and  more  recent  research  has  demonstrated 
that  one  score  cannot  possibly  represent  the  multi¬ 
dimensional  and  orthogonal  aspects  of  perform¬ 
ance.  Once  the  investigator  arrives  at  multiple 
criteria,  he  can  use  a  weighted  sum  of  the 
subcritcria  to  arrive  at  a  composite  evaluation. 
Schultz  and  Siegel  also  stressed  in  the  validation  of 
training  programs  the  need  to  determine  if 
performance  changes  over  time.  If  so,  one  might 
wish  to  sample  performance  at  different  times  or 
determine  if  a  longer  time  span  is  needed. 

Mcrrificld  (1965)  agrees  with  Schultz  and  Siegel 
(1961a)  about  the  need  lor  more  multivariate 
training  evaluative  studies.  Me  places  special 
emphasis  in  this  regard  on  the  special  abilities 
student. 
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A  second  trend  has  been  noted  in  terms  of 
emphasis  on  cross-cultural  training.  Brislin  ( 1 970) 
presents  a  rather  acid  critique  of  most  military 
cross-cultural  training  programs.  The  aim  of  cross- 
cultural  programs,  according  to  Brislin,  is  to  allow 
the  military  to  function  bchaviorally  and  effective¬ 
ly  in  a  foreign  environment.  Most  programs, 
though,  do  not  have  data  on  effectiveness,  and  the 
evaluative  methods  used  arc  inadequate.  When 
evaluations  were  conducted,  they  were  too 
dependent  on  verbal  and  written  reports  of  the 
trainees.  More  data  need  t'.>  be  collected  on  the 
actual  overseas  behavior  o'/  trainees;  therefore, 
responses  to  attitudinal  questionnaires  need  to  be 
verified  by  other  means.  Evaluation  needs  to  be 
conducted  by  rcseachcrs  not  associated  with  the 
program.  Also,  the  attitudes  of  foreign  nationals 
should  be  sampled.  Techniques  should  be  available 
to  assess  transfer  of  training  to  the  actual  foreign 
situation  with  more  replication  and  followup 
training. 

Fiedler,  Mitchell,  and  Triandis  (1970)  and 
Worchcl  and  Mitchell  (197U)  have  recently  de¬ 
scribed  an  exciting  new  technique  known  as  the 
Cultural  Assimilator,  which  is  based  upon  the 
critical  incident  technique.  In  this  technique, 
critical  incidents  are  obtained  in  which  the  norms 
or  behaviors  across  cultures  arc  quite  different. 
Questions  arc  asked  about  the  incident  with 
multiple-choice  answers  and  immediate  feedback. 
A  target  sample  from  the  host  culture  selects  the 
correct  multiple-choice  responses. 

An  experiment  recently  performed  by  the  Navy 
compared  two-  and  six-week  Vietnamese  language 
courses.  The  results  demonstrated  that  (</)  grad¬ 
uates  of  cither  course  met  most  objectives  in  that 
they  were  able  to  acquire  some  vocabulary  and 
conversational  skills;  (b)  students  of  higher  apti¬ 
tude  performed  extremely  well  in  the  six-week 
course;  (c)  the  language  laboratory  produced  prob¬ 
lems  which  were  later  rectified.  (</)  many  grad¬ 
uates  thought  die  course  was  inefficient  and  that 
they  did  not  use  all  that  they  were  taught;  and  (e) 
low-aptitude  students  were  only  marginally 
adequate. 

Predictive  Evaluation 

Richards,  Holland,  and  Lutz  (1967)  found  that 
non-academic  accomplishment  was  relatively  in¬ 
dependent  of  academic  achievement  in  college 
Non-academic  accomplishment  in  high  school 
correlated  .39  with  non-academic  accomplishment 
in  college.  On  the  other  hand,  the  American 
College  Testing  Program’s  College  Admissions  Test 
correlated  .29  with  college  grades,  and  high  school 


grades  correlated  .38  with  grades  in  college.  The 
authors  concluded  that  this  study  is  important  for 
college  admissions  officers  who  are  interested  in 
the  non-academic  as  well  as  the  academic  potential 
of  the  students  they  accept. 

Ryan  (1968)  compared  students  taking  a 
conventional  12th  grade  mathematics  course  with 
students  taking  an  experimental  mcthcmatics 
course  to  determine  if  prior  courses  in  high  school 
can  moderate  performance  in  college  courses.  The 
students  were  also  given  a  mathematics  achieve¬ 
ment  test,  a  mathematics  proficiency  test,  and  a 
verbal  ability  test.  The  results  showed  that  the 
mathematics  achievement  test  correlated  more 
highly  with  grades  than  did  the  mathematics 
proficiency  test  for  the  experimental  group  and 
visa  versa  for  the  the  conventional  group.  Also, 
students  in  the  experimental  group  performed 
significantly  better  than  conventional  students  on 
mathematics  achievement,  but  no  better  on 
mathematics  proficiency  or  verbal  ability.  Hence, 
the  achievement  test  probably  reflects  differences 
in  prior  instruction  rather  than  differences  in  more 
general  abilities. 

Goolsby,  Frary,  and  Lasco  (1968)  compared 
the  results  of  the  Florida  Bar  Examination  with 
grades  and  aptitude  test  scores  to  determine  if 
these  latter  measures  could  be  used  instead  of  part 
or  all  of  the  lengthy  and  expensive  Bar 
examination.  Only  low  correlations  were  found, 
causing  the  authors  to  conclude  that  no  aptitude 
test  scores  or  grades  could  supplant  the  Bar  exam¬ 
ination.  In  another  law  predictive  context  (Klein  & 
Evans,  1968),  nine  experimental  measures  were 
correlated  with  law  school  success  for  978  law 
students  across  several  schools.  Undergraduate 
grade  point  average  turned  out  to  lx?  the  best 
predictor  of  law  school  grade  point  average  yi 
some  schools,  while  the  Law  School  Admissions 
Test  was  the  bos’!  predictor  in  other  schools.  The 
authors  concluded  that  undergraduate  achieve¬ 
ment  can  predict  graduate  achievement  for  law 
school  students.  In  another  law  school  situation 
(Lunncborg  &  Lunneborg,  1967),  557  law  school 
students  were  surveyed  in  order  to  ascertain  which 
types  of  undergraduate  courses  predict  law  school 
success.  Verbal,  accounting,  and  language  courses 
were  found  to  be  the  poorest  predictors,  while 
philosophy,  economics,  history,  and  business 
administration  were  the  best. 

Kaplan,  Freedman,  and  Kaplan  (1968)  wished 
to  examine  the  utility  of  replacing  clinical  ratings 
of  psychiatry  students  with  the  National  Board  of 
Medical  Examiners  Test.  This  latter  test  was  found 
to  correlate  .44  with  the  ratings.  These  writers. 
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though,  indicate  that  other  types  of  information, 
in  addition  to  the  test  scort,  are  needed  because 
the  written  examination  does  not  account  to; 
enough  of  the  variance  of  the  dimensions  being 
investigated  by  the  ratings.  The  dimensions  of 
personality  and  psychopathology  arc  not  assessed 
by  the  test,  but  they  arc  assessed  by  the  ratings. 
Some  further  investigation  of  the  ratings  seems 
warranted,  though,  since  they  arc  so  much  more 
subject  to  bias  and  error  than  tests. 

Bergstrom  (1968)  related  measures  of  school 
achievement  to  important  job  behaviors  in  order 
to  evaluate  a  school  curriculum.  A  sample  of 
students  ( N  =  150)  was  taken  from  three  types  of 
schools:  (a)  urban  vocational,  (b)  urban  compre¬ 
hensive,  and  (c)  suburban  comprehensive.  The 
results  indicated  that  vocational  training  should 
stress  personal  adccpiacy  and  communication 
skills.  The  results  of  this  study  showed  that  (a) 
those  employees  with  specific  vocational  training 
were  more  likely  to  be  placed  on  a  related  job;(/>) 
students  with  low  grades  (D)  in  vocational  courses 
obtained  lower  job  evaluation  only  in  skill  areas  of 
the  job;  ( c )  graduates  who  were  poor  in  school 
attendance  tended  to  get  significantly  lower 
ratings;  and  (d)  one-half  of  all  trained  workers 
were  not  placed  or  retained  in  a  job  they  were 
trained  for. 

Bale,  Rickus,  and  Ambler  (1970)  wished  to 
determitiv  if  undergraduate  aviation  training  could 
be  used  as  a  predictor  of  graduate  or  replacement 
air  group  (RAG)  instruction.  The  traditional 
criterion  for  student  aviators  has  been  successful 
completion  of  undergraduate  (light  turning,  but 
this  was  felt  inadequate  because  it  did  not  account 
for  RAG  instruction.  The  grades  in  training  were 
based  on  (a)  air  to  air  weapons,  ( b )  air  to  ground 
weapons,  (c)  basic  ground,  and  ( it)  instrument 
navigation.  The  multiple  regression  coefficient 
between  training  grades  and  success-failure  in  RAG 
was  .43;  in  a  cross-validation  sample  it  was  .36. 
Use  of  these  prediction  measures  would  have 
reduced  attrition  in  RAG  by  34  percent.  The 
investigators  also  found  that  IS  tcst\  gave  a 
multiple  R  of  .43,  while  four  tests  gave  a  multiple 
R  of  .38. 

A  final  study  demonstrates  that  OCS  grades  can 
be  used  to  predict  officer  effectiveness  (Rhea. 
1965).  The  fitness  reports  of  2,1 83  OCS  graduates 
were  obtained  after  18  months  of  service.  A  low, 
but  significant,  correlation  between  each  OCS  vari¬ 
able  and  fitness  was  obtained  (average  r  =  .22).  In 
general,  fleet  fitness  reports  were  less  predictable 
than  shore  fitness  reports.  The  best  predictors 


were  final  school  grades  and  military  aptitude 
which  had  correlations  ranging  from  .16  to  .37. 

Sensitivity  Training 

Another  comparatively  recent  innovation  in¬ 
volves  sensitivity  training  and  its  associated 
methods  including  T-groups,  role  playing,  and  the 
like.  Bass,  Thiagarajan,  and  Ryterband  (1968)  are 
severely  critical  of  sensitivity,  or  T-group,  training. 
They  say  that  “.  .  .  we  still  may  hear  complaints 
about  the  lack  of  evaluation  of  sensitivity  training, 
yet  a  bibliography  of  at  least  50  evaluative  studies 
now  exists.  .  .  .  why  have  these  studies  failed  to 
impress  social  scientists?  ...  A  majoi  reason  may 
be  because  insufficient  attention  has  been  devoted 
to  the  purposes  of  the  evaluation  and  the  public 
for  whom  the  evaluation  is  being  prepared”  (p. 
20- 

One  very  controvcisial  study  by  Golembiewski 
and  Corrigan  (1970)  involved  an  assessment  of 
change  resulting  from  sensitivity  training.  The 
sample  in  this  study  was  16  commercial  sales 
managers.  Progress  was  measured  by  self-report  on 
the  48  items  of  Likert’s  (1967)  Profile  of  Organ¬ 
izational  Cnaractcristics,  The  participants  rated 
their  organization  twice,  once  as  their  conception 
of  the  ideal,  and  once  as  they  perceived  it  to  be  in 
actuality.  This  was  done  both  early  in  the  week  of 
training  and  four  months  after  training.  Both 
“ideal”  and  “now”  scores  increased  in  the  interim 
in  the  “participative”  direction,  thus  supporting 
the  authors’  hypothesis.  The  authors  themselves 
acknowledge  the  possibility  of  the  Hawthorne 
effect  or  other  methodological  weaknesses  in  their 
design,  but  tend  to  minimize  such  possibility  in 
favor  of  true  change.  Becker  (1970),  though, 
seems  to  think  the  study  is  of  little  value  for 
several  reasons:  Golembiewski  and  Carrigan  failed 
to  rule  out  alternative  explanations;  they  indicated 
that  the  Hawthorne  effect  cannot  be  rejected,  yet 
they  rejected  it;  and  they  failed  to  account  for 
changes  which  could  have  occurred  through 
passage  of  time.  Becker  closes  with  “.  .  .  changes 
did  and  probably  continued  to  occur,  so  it  may  be 
permissible  to  sell  such  a  design  to  managements; 
but  under  no  circumstances  should  one  attempt  to 
sell  such  a  design  as  science  (p.  96).” 

In  another  study  (Cook,  Hahn,  &  Sheppard, 
1971),  23  Navy  Medical  Service  officers  took  part 
in  a  three  and  one-half  day  management  style 
seminar,  a  six  month  intervening  period  at  a  duty 
station  followed;  then  a  two  and  one-half  day 
management  style  session  was  conducted  In  their 
training  sessions,  the  officers  were  presented  with 
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(a)  problem  analysis  using  “force  field  method;” 

(b)  group  ranking  which  allowed  for  cross-subject 
influencing;  and  (c)  small  group  management  style 
sessions.  In  the  six-month  intervening  period,  the 
subjects  were  urged  to  use  their  newly  acquired 
techniques.  The  final  session  included  discussion, 
reinforcement,  and  feedback  of  management  style 
data.  The  Management  Value  Index  (MVI),  an 
index  of  management  style,  was  given  at  the 
beginning  and  end  of  the  first  session,  and  at  the 
end  of  the  second  session.  The  results  indicated 
course  influence.  The  Leadership  Opinion 
Questionnaire  was  also  administered,  and  the 
results  indicated  a  decrease  in  structure  without  a 
corresponding  decrease  in  consideration.  These 
results  are  somewhat  suspect,  since  participants 
thought  their  management  styles  were  more  ope,, 
than  did  their  colleagues  and  subordinates, 
especially  with  regard  to  participation.  The 
authors  concluded  that  the  much  larger  value 
change  between  the  second  and  third  administra¬ 
tion  of  the  MVI  suggests  the  need  for  an  on-the- 
job  “incubation  period”  in  order  for  attitudes  to 
change. 

Federman  and  Siegel  (1965),  in  a  group  dy¬ 
namics  study,  isolated  four  performance-related 
communication  factors  from  training  teams  in  a 
helicopter  simulator.  These  four  factors  were 
derived  from  a  factor  analysis  of  14  communica¬ 
tion  predictors  shown  to  be  related  to  miss 
distance  in  antisubmarine  warfare.  The  four 
factors  were  (a)  probabilistic  structure,  ( b )  evalua¬ 
tive  interchange,  (c)  hypothesis  formulation,  and 
( d)  leadership  control.  In  a  second  study,  Siegel 
and  Federman  (1969)  cross-validated  the  factors 
and  developed  a  training  course  based  on  tire 
derived  factors.  The  trained  group  was  found,  to 
perform  better  than  a  control  (untrained)  group  in 
two  performance  tests  involving  enemy  submarine 
detection  and  destruction. 

Programmed  Instruction 

Lumsdainc  (1970)  feels  that  the  most  impor¬ 
tant  contribution  of  programmed  instruction  is 
not  improvement  in  instruction,  but  rather  in  the 
implicit  requirement  for  clearly  stated  objectives 
in  behavioral  terms. 

Mager  (1970a,  1970b)  maintained  that  it  is 
impossible  for  the  instructor  to  apply  all  the 
principles  of  learning  in  the  classroom.  This  is  not 
because  he  docs  not  want  to,  but  because  the 
learning  environment  is  prohibitive.  "We  still  put 
large  groups  of  students  in  from  of  a  single  instruc¬ 
tor  and  insist  that  they  all  leant  at  the  same  rate 


(p.  4).”  This  procedure  may  be  convenient  and 
inexpensive,  but  it  is  inefficient.  Programmed 
learning  devices  and  machines  are  held  to  possess 
the  potential  for  solving  these  problems  since  they 
usually  (a)  present  instruction  in  small  steps;  (b) 
reinforce  the  student  along  the  way;  (c)  help  the 
student  proceed  at  his  own  pact;  and  ( d)  feed  back 
responses  into  the  device  to  modify  instruction  to 
fit  the  particular  needs  of  the  student. 

In  sequential  programming,  learning  proceeds  in 
very  small  steps,  and  all  learners  go  through  the 
same  steps.  In  alternate  programming,  though,  the 
student’s  steps  can  be  different,  and  they  are 
governed  by  the  student’s  own  responses. 

Keller  (1968)  indicated  that  the  techniques  of 
programmed  instruction  can  be  used  in  any  class¬ 
room  situation.  However,  according  to  Keller,  one 
criterion  that  the  instruction  must  meet  is  that  it 
be  individualized.  Another  requirement  is  that 
criterion-referenced  testing  be  used. 

Lindvall  and  Cox  (1969)  present  a  Structured 
Curriculum  Model  (SCM)  for  developing  a  pro¬ 
grammed  instructional  course.  They  state  that  one 
must  define  specific  objectives  and  organize  them 
according  to  difficulty  or  prerequisites.  This  organ¬ 
ization  provides  a  structural  sequence  which  is  a 
frame  for  determining  the  student’s  present  status 
and  for  his  future  planning.  In  the  SCM,  the 
curriculum  materials  must  be  matched  to  the 
objectives,  and  one  must  keep  in  mind  that 
students  can  master  the  same  objectives  with 
different  kinds  of  material.  In  addition,  the 
student  must  be  given  a  diagnostic  evaluation  to 
place  him  in  the  proper  location  along  the  learning 
continuum.  The  placement  test  should  “.  .  . 
select  items  which  test  representive  objectives 
along  the  continuum  (p.  170).”  Pretests  are  also 
suggested  prior  to  each  instructional  unit,  because 
the  student  may  be  able  to  cope  with  some  of  the 
objectives  in  the  unit,  and  not  others.  Evaluation 
in  this  model  is  by  way  of  “curriculum  embedded 
tests”  and  “post-unit”  tests.  Curriculum  embedded 
tests  (a)  measure  one  objective  of  a  unit;  (b)  they 
are  content-referenced;  (c)  they  are  short;  and(r/) 
they  enable  the  teachers  to  make  decisions  regard¬ 
ing  student  advancement.  Post-unit  tests  help  the 
teacher  to  decide  whether  the  pupil  should 
progress  to  the  next  unit  or  should  be  given 
remedial  work 

Glaser  (1967)  insists  that  uniformity  within  any 
one  grade  level  can  never  be  achieved  because  of 
individual  differences.  This  results  in  the  need  for 
programmed  or  computer  oriented  instruction. 
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Glaser  also  suggests  that  too  much  research  has 
been  done  comparing  methods  and  not  enough 
research  has  been  done  on  learning  what  and  how 
variables  affect  students.  Glaser  describes  the 
requirements  for  individualized  instruction  that 
have  been  set  forth  at  the  Learning  Research  and 
Development  Center: 

1.  Time  limits  and  grade  levels  must  be 
redesigned  so  the  student  works  at  his  actual 
achievement  level,  he  progresses  only 
after  he  has  mastered  the  prerequisites  for 
the  next  higher  level. 

2.  Sequences  of  progression  must  be  assigned 
to  each  student. 

3.  Progress  must  be  continually  assessed  to 
modify  the  teaching  program  to  fit  pupil 
needs. 

4.  Materials  should  be  provided  to  the  student 
which  will  self-direct  his  learning. 

5.  Performance  standards  (feedback)  should  be 
provided  to  the  student. 

6.  A  data  processing  system  should  be  provided 
so  that  the  teacher  can  take  advantage  of 
detailed  information  about  each  student, 
and  construct  an  appropriate  program  for 
him. 

7.  Pretests  and  posttests  should  be  provided  for 
each  instructional  unit. 

8.  Sequential  testing  procedures  should  be 
employed  for  initial  placement. 

Whitmore  (1970c,  pp.  33-34)  recites  four  learn¬ 
ing  prinicples  that  are  contained  in  automated 
individualized  instruction  that  are  not  generally 
>und  in  traditional  instruction.  These  learning 
principles  arc  (a)  continuous  participation  by  the 
student  in  the  instructional  process;  ( b )  providing 
immediate  knowledge  of  tire  results  to  the  student 
for  each  response  that  he  makes;  (c)  recognition  of 
individual  differences  in  rate  of  learning;  and  ( d ) 
providing  a  high  rate  of  success  for  the  student 
throughout  learning. 

The  last  principle,  Whitmore  says,  is  the  most 
difficult  to  implement,  since  it  requires  veiy 
careful  analysis  of  the  material  to  be  learned. 

McFann  (1969a,  1969b)  characterizes  training 
strategies  and  their  characteristics  as  follows: 


Strategy 

Curriculum 

Time 

Standard 

t 

Fixed 

Fixed 

Variable 

2 

Fixed 

Variable 

Fixed  or 
variable 

3 

Variable 

Fixed 

Variable 

4 

Variable 

Variable 

Fixed  or 
variable 

In  this  scheme,  a  fixed  standard  means  that  the 
student  is  to  reach  a  minimal  level,  while  a  variable 
standard  means  that  the  student  can  go  beyond 
the  minimal  level  to  another  higher  level. 

Strategy  1  is  only  recommended  when  the 
input  to  the  course  is  homogeneous;  if  it  is  not, 
there  will  be  variable  output.  It  ignores  individual 
differences  and  involves  the  additional  problem  of 
where  to  set  the  level  of  training.  Strategy  2  is 
similar  to  most  present  training  in  the  military. 
Those  who  fail  to  pass  the  first  time  are  recycled 
(variable  output  time).  One  can  gear  the  training 
to  low-aptitude  men,  or  allow  the  more  intelligent 
men  to  go  through  the  program  faster.  Strategy  3 
has  a  fixed  time  limit  and  will  result  in  variable 
output.  Strategy  4  is  the  most  flexible  and  the 
most  individualized,  but  it  requires  the  best 
management. 

Computer  Assisted  Instruction  (CAI)  and  Testing 

Computer  assisted  instruction  represents  one  of 
the  most  recent  innovations  in  training  method¬ 
ology.  One  of  the  main  problems  of  CAI  is  its  cost 
when  compared  with  other  similar  methods  which 
might  give  equivalent  results  (e.g.,  TV)  Another, 
more  serious,  objection  to  CAI  is  that  it  does  not 
allow  the  student  enough  opportunity  or  freedom 
to  chart  his  own  progress  (Hammel,  1969). 

Hansen,  Hedl,  and  O’Neal  (1971)  feel  that 
computer  assisted  testing  will  come  into  full 
flower  this  next  decade.  One  reason  given  for  this 
is  the  evidence  that  people  answer  questionnaires 
more  honestly  when  they  are  presented  via 
computer  than  by  traditional  methods. 

Holtzman(1971)  says,  “In  a  traditional  setting, 
tire  instructor  keeps  a  record  of  how  well  each 
student  does  on  each  achievement  test  for  the 
course,  while  the  periodically  collected  scores 
from  standardized  nonnative  tests  arc  stored 
centrally.  When  instruction  is  individualized,  test¬ 
ing  must  be  done  more  frequently  and  at  different 
times  for  each  student  (pp.  547-548).” 

Seidel  (1969)  discusses  the  purposes  of  project 
IMPACT  which  is  to  provide  the  Army  with  an 
appropriate  and  efficient  CAI  system  adaptable  to 
the  individual  trainee.  Programs  arc  to  be  branched 
and  adapted  to'  the  entry  characteristics  of  the 
trainee  and  his  pcrfonnance  throughout  instruc¬ 
tion.  Some  of  the  important  decision  factors 
involved  arc  (a)  entry  characteristics,  ( b )  education 
and  background,  ( c )  responses  of  trainee,  (d) 
response  latency,  ( e )  pattern  and  history'  of  errors, 
(/)  relation  of  individual  and  group  norms  to 
responses,  and  (g)  subject  matter. 
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Gagne  (1968)  disagrees  with  most  of  these 
writers  regarding  the  usefulness  of  computers  in 
testing  (and  instruction).  He  thinks  that  CAI  puts 
too  much  stress  on  the  machine  rather  than  on  the 
student 

Atkinson  (1967)  discusses  three  levels  of  CA1: 

1 .  Simple  -  “fixed,  linear  sequence  of  problems 
(p.  56).”  There  is  no  method  of  changing  the 
instruction  as  a  consequence  of  the  student’s 
responses.  They  arc  also  called  “drill  and 
practice”  systems. 

2.  Complex  -  also  called  “dialogue”  systems. 
They  provide,  high-level  inte.  action  between 
student  and  system.  The  students  can  give 
many  variations  of  response,  can  ask  a 
variety  of  questions,  and  can  generally 
control  the  sequence  of  learning. 

3.  Tutorial  -  are  between  simple  and  complex 
with  regard  to  lire  student’s  interaction  with 
xhe  system.  There  can  be  decision  making  or 
branching,  depending  upon  the  student’s 
responses.  The  students  can,  therefore, 
follow  separate  paths.  One  of  Atkinson’s 
findings  was  that  fast  learners,  on  a  month 
by  month  basis,  showed  a  continual 
improvement  in  rate  of  progress,  while 
medium  and  slow  students  had  constant 
rates  of  improvement 

Ferguson  (1970)  described  hiw  computer 
assisted  criterion-referenced  measurement  was 
applied  to  an  experimental  school  in  individually 
prescribed  instruction  (1P0.  Addition  and  subtrac¬ 
tion  skills  were  taught  in  a  sequence  in  wliich  each 
stage  built  onto  and  was  required  for  the  next 
stage.  After  each  answer,  the  computer  made  a 
decision,  on  the  basis  cf  percentage  correct  and 
number  of  problems  of  this  type  attempted, 
whether  to  go  to  the  next  level  or  continue 
presenting  problems  of  the  same  type.  Eech  item 
was  randomly  selected  from  a  population  of 
similar  items.  Direct  manipulation  of  type  1  or 
type  II  errors  was  possible.  The  type  I  emu  allows 
the  student  to  progress  to  the  next  level  prior  to 
mastery;  therefore,  this  is  considered  the  most 
serious  type  of  error. 

Applications  of  Programmed  Instruction 

Yeager  and  Kissel  ( 1 969)  hypothesized  that  the 
number  of  days  needed  to  master  a  unit  of  instruc¬ 
tion  is  related  to  the  students’  “initial  entering 
state.”  The  entering  state  variables  were  (a)  unit 
pretest  score  which,  when  subtracted  from  ICO, 
gives  tire  distance  or  amount  to  be  learned;  ( b ) 
number  of  types  of  pretest  skills  on  which  the 


student  failed  to  show  mastery  (IP!  only  concen¬ 
trates  on  these);  (c)  intelligence;  and  id)  age  which 
reflects  student  maturity.  The  entering  state 
variables  used  in  this  study,  therefore,  were  prctcsi 
scores,  number  of  skills  to  bo  mastered,  I.Q.,  age, 
and  total  units  mastered  previously.  The  results 
demonstrated  that  pretest  score,  numbers  of  skills 
to  be  mastered,  and  age  were  the  best  predictors, 
while  I.Q.  score  had  the  least  influence.  The 
multiple  correlation  coefficients  for  aifferent 
types  of  materials  tanged  from  .65  to  .84  (N= 40). 

Atkinson  (1967)  found  that  students  in  an 
experimental  CA1  reading  program  performed 
significantly  better  in  all  aspects  of  reading  (e.g., 
pronunciation,  vocabulary,  recognition)  than  did 
students  in  conventional  (control)  reading  classes. 
The  control  group  received  CA1  mathematics 
instruction,  but  not  CAI  reading  instruction. 

K.  Johnson  (1968)  examined  the  results  of 
three  different  methods  of  teaching  military  com¬ 
munications  courses.  The  three  methods  used  were 
conventional,  programmed  instructional  booklets, 
and  partially  individualized  (first  week 
conventional  followed  by  self-paced).  The  results 
showed  that  the  self-paced  (partially  individ¬ 
ualized)  instruction  produced  a  16  percent 
reduction  in  course  length,  while  the  programmed 
instruction  produced  a  9  percent  decrease  in 
course  length.  These  reductions  were  accomplished 
without  loss  of  skill. 

Geiscrt  (1970)  wished  to  examine  the  contribu¬ 
tion  of  format  and  feedback  to  learning.  Two 
groups  of  Army  National  Guardsmen  (N=44)  were 
used  as  subjects.  All  concepts  to  be  learned  in  the 
experimental  group  were  arranged  hierarchically 
(mapped)  to  case  positive  transfer  to  the  next 
highest  level.  Fifteen  dependent  variables  were 
used  including  reading  time  on  booklet,  test 
scores,  time  spent  reading  instructions,  time  spent 
on  practice,  and  time  spent  on  problem  solving 
instructions.  The  results  demonstrated  no  signifi¬ 
cant  differences  between  the  hierarchical  group 
and  the  traditional  group,  except  that  ute  former 
group  tended  to  do  all  tilings  slightly  foster. 
Similar  results  were  obtained  for  the  fccdback-no 
feedback  group.  With  regard  to  certain  attitude 
scales  which  were  administcreu.  it  was  shown  that 
subjects  preferred  to  learn  from  the  mapped- 
fecdback  system  over  the  traditional  system.  The 
subjects  also  thought  that  a  computer  assisted 
screen  was  an  effective  way  to  present  material 
when  compared  to  booklet  material,  although 
neither  was  shown  to  be  more  or  less  effective 
than  the  other. 
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A  novel  and  interesting  approach  to  self-paced 
instruction  was  recently  developed  by  Sheppard 
and  MacDcrmot  (1970).  Subjects  were  203 
students  enrolled  in  an  experimental  course  and  98 
students  enrolled  in  a  traditional  course.  The 
students  in  the  experimental  group  were  to  study 
one  of  36  sections  of  a  psychology  book.  After 
study,  the  students  were  asked  to  explain  the 
lesson  in  detail  to  another  student  who  had 
already  completed  the  work,  or  to  an  instructor.  If 
the  learner  failed,  he  would  repeat  the  lesson  until 
mastery  was  achieved.  Completion  of  all  36  inter¬ 
views  earned  a  grade  of  A,  75  percent  a  grade  of  B, 
50  percent  a  C,  and  33  percent  a  D.  The  control 
group  was  as  comparable  as  possible,  since  the 
students  spoke  in  small  groups  and  used  the  same 
book.  At  course  completion,  both  groups  were 
given  100  multiple-choice  questions  and  five  essay 
questions.  The  control  group  was  told  that  the 
final  examination  contributed  50  percent  of  their 
grade,  while  the  experimental  group  was  told  that 
the  final  examination  did  no*  count.  In  addition, 
the  control  group  was  informed  that  they  had  to 
finish  the  entire  test.  These  last  two  factors  should 
produce  a  bias  in  favor  of  the  control  group.  The 
mean  for  the  experimental  group  on  the  multiple- 
choice  test  was  73.1,  and  for  the  control  group  it 
was  66.8  {p<  .01).  On  the  essay  questions,  the 
experimental  group  scored  17.4,  and  the  control 
group  13.9  (p<  .01).  Also,  composite  student 
satisfaction,  as  measured  by  an  attitude  scale,  was 
liigher  for  the  experimental  group  ( p<  .01).  Of 
those  queried,  94  percent  thought  the  interview 
method  was  mote  effective  than  the  lecture 
method. 

'  Siegel  and  Fischl  (1965)  were  concerned  with 
pre-emergency  training  which  prepares  the  public 
for  a  disaster  or  critical  situatk  it.  They  employed 
a  technique  known  as  “adjunct  auto-instruction,” 
which  is  meant  to  supplement  other  training  tech¬ 
niques  or  points  that  need  emphasis  and  stress. 
Adjunct  auto-instruction  tends  to  keep  the  learner 
active,  and  gives  him  feedback.  The  subjects  were 
four  matched  groups  (iV  =  9  to  13  per  group)  of 
semi-skilled,  adult,  employed  women  receiving 
attack  survival  material.  The  four  experimental 
conditions  provided  that  the  suojectr  («)  receive 
material  by  phone,  (b)  read  material  in  print,  (c) 
read  material  in  print  and  receive  adjunct 
auto-instruction,  or  ('/)  receive  material  by  tele- 
piionc  and  receive  adjunct  auto-instruction.  The 
non-adjunct  groups  were  presented  the  material 
twice  to  equate  for  exposure  time.  A  final  exam¬ 
ination  administered  at  the  end  of  training  demon¬ 
strated  that  both  adjunct  types  were  significantly 
superior  in  promoting  learning  gains  over  non¬ 
adjunct  materials  (p<  .01). 


A  CAI  data  management  system  was  developed 
by  Ford  and  Slough  (1970)  for  an  electronics 
course  module.  The  course  was  tried  out  and 
revised  three  times  using  a  total  of  52  subjects. 
Next,  the  module  was  compared  with  normal  class¬ 
room  training  using  51  CAI  subjects  and  200 
traditional  subjects.  Afterwards,  both  groups  took 
a  standard  school  examination  and  a 
supplementary  test.  For  all  ability  levels,  CAI 
produced  higher  achievement  than  traditional 
classroom  instruction.  In  addition,  CAI  produced 
time  savings  of  33  to  44  percent. 

Showel,  Taylor,  and  Hood  (1966)  constructed  a 
leadership  training  package  including  tapes,  film¬ 
strips,  and  workbooks.  This  training  package  was 
used  for  an  experimental  group  while  a  control 
group  received  traditional  instruction  ( i.e ., 
lectures).  The  subjects  were  matched  on  the 
General  Technical  Aptitude  area  of  the  Army 
Classification  Battery  and  randomly  assigned  to 
control  and  experimental  groups.  An  essay  exam¬ 
ination  was  used  to  test  achievement  immediately 
after  training  and  10  weeks  after  training.  The 
results  demonstrated  that  the  leadership  auto¬ 
mated  package  produced  greater  gain  and  was  less 
cosily  than  the  conventional  package. 

Steadman,  Bilinski,  Coady,  and  Stcinemann 
(1969)  were  interested  in  investigating  alternate 
methods  of  training  low-aptitude  Naval  personnel. 
Of  31  subjects,  half  were  taught  by  instructor  and 
half  by  programmed  text.  Achievement  was 
measured  by  three  quizzes  and  a  practical  perform¬ 
ance  test.  Upon  the  termination  of  training,  only 
eight  subjects  readied  an  adequate  proficiency 
level  i..  terms  of  tire  final  practical  performance 
test  These  writers  concluded  that,  in  general,  the 
course  was  not  appropriate  for  low-aptitude 
personnel. 

Programmer  Characteristics 

Tiie  selection  of  programmers  for  programmed 
learning  is  just  as  important  as  the  selection  of 
materials.  Some  of  the  characteristics  of  successful 
program .  irs  are  (a)  “relatively  high  intelligence,” 
(b)  “interests  in  the  arca,”(c)  “attitudes  favorable 
to  the  area  and  favorable  to  achieving  the  goal,” 
(d)  “compulsivity,”  and  (/)  “functional  level  of 
motivation  (Mclching,  1970,  pp.  71-72).” 

Television  Instruction 

TV  instnretion,  although  not  used  in  the  sain' 
way  as  CAI,  is  much  less  costly.  TV  instnretion 
seems  advantageous  when  instructor  shortages 
exist,  rapid  dissemination  of  information  is 
required,  and  student  communication  is  not 
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necessary.  This  type  of  instruction  is  disadvan¬ 
tageous  when  applied  lessons  and  student 
communication  arc  needed. 

Basic  Education 

Standlce  and  Hooprich  (1962)  feel  that  most 
tests  of  the  effects  of  adult  reading  courses  lack 
sophistication.  Most  experimenters  measure  read¬ 
ing  ability  before  and  after  training,  but  fail  to 
control  for  such  factors  as  initial  reading  level, 
intelligence,  motivation,  equivalence  of  forms,  test 
practice  effects,  set,  test  ceiling  effects,  change, 
regression  effects,  timed  tests,  type  of  test  score, 
criterion  choice,  and  differences  between  control 
and  experimental  subjects.  These  authors,  after 
reviewing  several  sound  studies,  arrived  at  the 
following  conclusions: 

1 .  Reading  speed  gains  are  real.  What  happens 
to  comprehension  and  vocabulary  is  un¬ 
certain,  since  they  are  confounded  with 
speed.  Eye  movements  usually  improve. 

2.  Reading  speed  gains  arc  retained.  Generally, 
60  to  70  percent  was  retained  after  six 
months  to  two  years. 

3.  Reading  instruction  gains  transfer  to 
academic  achievement,  academic  aptitude, 
clerical  ability,  and  temperament.  These 
gains  may  not  be  due  to  reading  instruction, 
though,  because  these  courses  may  also 
teach  study  skills,  or  give  counselling  and 
therapy  which  can  also  be  associated  with 
improvement. 

4.  No  methods,  materials,  or  programs  of 
instruction  were  mown  to  be  superior  to  any 
other.  Also,  no  individual  differences  in 
personality,  intelligence,  or  occupation  were 
associated  with  reading  skill  gains. 

5.  Reading  improvement  courses  are  helpful  for 
those  whose  jobs  depend  upon  reading.  In 
this  case,  increased  speed  is  enough  justifica¬ 
tion  for  taking  the  course. 

Stcinemann,  Hooprich,  Archibald,  and  Van 
Matrc  (1971)  investigated  the  effects  of  a 
“wordsmanship”  course  given  to  176  (ow-aptitude 
Naval  personnel.  These  subjects  characteristically 
have  low  verbal  aptitude  and  unfavorable  language 
attitudes  which  cause  a  bias  against  learning. 
Nevertheless,  these  investigators  found  that  “the 
trainees  substantially  improved  their  knowledge 
and  proficiency  in  each  of  the  sub-course  areas  of 
wordsmanship,  and  most  students  reported  a  more 
favorable  attitude  toward  words  and  a  desire  for 
self  improvement  of  verbal  skills.” 


Moilenkopf  (1969)  gave  different  100-hour 
basic  skills  training  courses  (computation,  spelling, 
filing,  reasoning,  paragraph  meaning)  to  three 
different  groups  (office  workers,  laboratory  tech¬ 
nicians,  and  production  employees).  Most  of  the 
participants  made  sizable  gains  and  most  pretest 
and  posttest  score  differences  were  significant, 
although  regression  and  ceiling  effects  may  have 
been  involved.  In  almost  all  of  the  tests,  at  least  80 
percent  of  the  students  made  gains. 

Hooprich  and  Stcinemann  (1966)  indicated 
that  there  is  “a  general  trend  toward  performance- 
oriented  training  courses  in  which  technical  mathe¬ 
matics  and  unnecessary  electronics  theory  are 
minimized.  .  .  .  Increasing  investigative  attention 
devoted  to  performance  evaluation  problems  is  a 
reflection  of  the  growing  recognition  of  perform¬ 
ance  assessment  as  a  critical  factor  in  the  final 
evaluation  of  total  training  effectiveness  (pp. 
17-18).” 

Kent,  Bishop,  Byrnes,  Frankcl,  and  Herzog 
(1971a,  1971b)  attempted  to  identify  the  Adult 
Basic  Education  (ABE)  courses  that  were  success¬ 
ful  in  job  related  settings  (e.&,  obtaining  job, 
promotions,  entering  training).  Information  was 
collected  on  80  programs  whose  features  or 
aspects  were  typed.  Fifteen  programs  containing 
all  features  of  interest  were  selected  for  the  study. 
Checklist  interviews  were  used  to  obtain  data.  The 
findings  indicated  that  (a)  there  is  n  great  need  for 
ABE  in  basic  abilities  which  vary  from  student  to 
student  and  job  market  to  job  market;  ( b )  the 
need  for  job  related  ABE  is  not  being  met  in  that 
the  programs  do  not  perform  enough  job  place¬ 
ment,  skill  training,  post  instructional  followup  of 
students,  self-evaluation,  and  improvement  of 
materials;  ( c )  theory,  administration,  and  money 
are  inadequate;  ( d)  ABE  programs  should  co¬ 
operate  among  themselves  and  with  large  centers 
for  research;  and  (c)  organizations  should  be 
invited  to  bid  in  order  to  conduct  ABE  job  related 
programs. 

Training  Devices 

Edgcrton  and  Fryer  (1950)  have  prepared  a 
system  for  preliminary  evaluation  of  a  training  aid. 
This  system  has  tire  following  features:  (a)  it  is 
uniform  and  consistent;  (/>)  it  is  brief;  (c)  it  needs 
no  special  skills  to  administer;  (r /)  it  improves 
validity  of  technical  judgments;  (e)  it  shows 
advantages  and  defects  of  the  training  aid;  (/)  it 
provides  for  an  overall  judgment;  and  (g)  it  yields 
information  from  which  an  experimental  evalua¬ 
tion  of  the  training  aid  can  be  constructed. 
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Richardson,  Bellows,  Henry  &  Co.  (1962) 
developed  three  evaluation  forms  for  new  training 
devices.  These  forms  were  constructed  from  litera¬ 
ture  reviews,  descriptions  of  Navy  devices,  descrip¬ 
tions  of  industrial  devices,  and  evaluation  reports. 
These  questionnaires  were  validated  using  the 
nomination  technique  in  which  instructors  and 
training  officers  nominated  devices  js  “best”  or 
“worst.”  The  resultant  validity  and  reliability  of 
the  three  methods  proved  adequate  enough  for 
use. 

Siegel  and  Federman  (1969)  used  Guilford's 
(1967)  stnicturc-of-intellcct  (SI)  model  to  help 
derive  the  most  appropriate  aids  and  devices  for 
training  the  tactical  coordinator  in  the  l’-3c  air¬ 
craft.  Guilford’s  model  allows  the  description  of 
the  mental  tasks  an  operator  performs  in  terms  of 
intellectual  load.  These  descriptions  arc  quantita¬ 
tively  derived,  and  the  needed  aids  and  devices  can 
be  based  upon  them.  The  operations  in  the  SI 
model  specify  the  type  of  aids  or  devices  for  train¬ 
ing.  The  contents  in  the  SI  model  tell  die  subject 
matter  of  die  aids  or  devices.  Finally  the  SI 
products  tell  what  is  to  be  learned.  The  audiors 
conclude  that  this  technique  defines  training 
requirements  and  closes  .  .  the  loop  between 
job  analysis  and  the  aid/device  derivation.” 

Instructor  Evaluation 

A.  Harris  (1969)  has  found  ”.  .  .  differences 
among  teachers  far  more  important  than  differ¬ 
ences  between  mcdiods  and  materials  in  influ¬ 
encing  the  reading  achievement  of  children  (p. 
204).”  The  main  criterion  of  teacher  effectiveness 
should  be  pupil  gain  on  standardized  tests.  The 
correlations  between  teacher  ratings  and  tests  arc 
not  large  enough  to  support  the  use  of  ratings. 

Bittner  (1968)  recently  executed  an  interesting 
analysis  of  student  evaluations  of  instructors. 
Subjective  comments  were  collected  from  students 
on  oral  communication  factors.  These  statements 
were  content  analyzed  by  six  speech  teachers 
(interrater  reliability  =  .73).  l  ive  categories  were 
derived:  (a)  rate  of  speaking,  (h)  volume,  tone,  and 
pitch,  (c)  use  of  audio-vidtial  aids,  (d)  use  of 
discussion,  and  (c)  organization  of  lecture.  The 
largest  number  of  comments  concerned  organiza¬ 
tion  of  lecture,  while  volume,  tone,  and  pitch  had 
the  smallest  number  of  comments.  The  most 
negative  comments  concerned  volume,  tone,  and 
pitch,  and  the  most  positive  concerned  use  of 
audio-visual  aids.  Rate  of  speaking  was  also  some¬ 
what  negatively  appraised.  In  addition,  more 
negative  comments  were  associated  with  graduate 
teaching  assistants  than  with  any  other  category. 


Veldman  and  Peek  (1969)  wished  to  determine 
the  influence  on  pupil  evaluations  of  student 
teachers.  These  authors  felt  that  the  most  reliable 
description  of  teacher  behavior  comes  from  the 
students.  The  Pupil  Observation  Survey  (POSR) 
consisted  of  38  items  grouped  into  10  scales. 
POSR  data  were  collected  on  554  student  teachers 
at  the  University  of  Texas.  The  data  were  then 
factor  analyzed,  yielding  five  factors:  (a)  friendly 
and  cheerful,  (b)  knowledgeable  and  poised,  ( c ) 
lively  and  interested,  (d)  firm  control,  and  ( c ) 
non-directive.  Analysis  of  covariance  was  used  to 
determine  if  five  characteristics  (grade  in  student 
teaching,  grade  of  class,  subject  area,  socio¬ 
economic  status,  level  of  school,  and  sex  of 
teacher)  had  any  effects.  The  results  demonstrated 
that  (a)  all  factors  increased  with  increased  student 
leaching  grade:  (b)  only  friendly -cheerful  and 
lively-interested  were  positively  and  inversely 
related  to  grade  level  of  students;  (c)  all  factors 
except  knowledgeable-poised  were  related  to 
subject  matter  area;  (d)  as  social  class  decreased, 
lively-interested  increased,  firm  control  decreased, 
and  non-directive  increased;  and  (c)  females  were 
rated  higher  on  friendly-cheerful  than  males. 

Hiller,  Fisher,  and  Kacss  (1969)  performed  a 
computer  investigation  of  the  verbal  characteristics 
of  effective  classrooom  lecturing.  Fifty-five  15- 
minute  lectures  producing  105,000  words  were 
analyzed  fo.  verbal  fluency,  optimal  information 
amount  knowledge  structure  cues,  interest,  and 
vagueness.  The  findings  demonstrated  that  vague¬ 
ness  in  the  lecture  was  most  important.  Vagueness 
is  defined  as  “.  .  .  the  stale  of  mind  of  a  per¬ 
former  who  docs  not  sufficiently  command  the 
fac’s  or  the  understanding  required  for  maximally 
effective  communication  (p.  670).” 

Military  Research 

Electronics  Technicians.  Applied  Psychological 
Services  (1971)  recently  developed  a  quick  course 
of  passive  sonar  training  for  system  technicians. 
First,  the  training  lequircmcnts  were  developed, 
followed  by  a  course  which  was  balanced  between 
practical  work  and  lecture  presentation.  Sonar 
technicians  were  given  the  course  in  one  week. 
After  finishing  the  course,  they  each  completed  a 
13-item  questionnaire.  The  mean  value  on  a  four- 
point  scale  for  all  13  questions  was  3.4.  High 
values  were  concerned  with  the  amount  the 
student  learned  in  the  course.  The  authors  con¬ 
cluded  that  this  project  was  extremely  useful, 
since  it  demonstrated  that  quickly  but  systemat¬ 
ically  developed  courses  could  be  useful. 
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Bilinski,  Saylor,  and  Standlee  (1969)  used  an 
analysis  of  on-the-job  feedback  to  help  increase 
training  effectiveness.  Electronics  technician  grad¬ 
uates  were  examined  in  regard  to  their  ability  to 
maintain  a  radar  system.  First,  a  job  analysis  was 
performed;  thc.n  a  structured  interview  was  con¬ 
structed  from  the  job  analysis  to  obtain 
information  from  a  fleet  sample  of  electronics 
technicians.  This  procedure  elucidated  difficult 
maintenance  and  problem  areas  for  feedback  into 
the  training  school. 

Stcincmann,  Coady,  Harrigan,  and  Matlock 
(1968)  wanted  to  evaluate  the  job  capabilities  and 
fleet  utilization  of  64  four-year  obligor  graduates 
of  electronics  technician  phase  A-l  training. 
Performance  measures  and  objective  ratings  were 
collected.  Most  electronics  technicians  were  found 
to  be  more  or  less  adequate.  However,  training 
limitations  made  on-tlie-job  training  and  initial 
supervision  necessary  for  all  but  the  most  routine 
tasks.  Troubleshooting  was  found  to  be  the 
weakest  area.  It  was  recommended  that  four-year 
obligors  be  given  more  training,  or  only  be  allowed 
to  assist  in  fleet  maintenance  tasks.  Steadman  and 
Harrigan  (1971)  obtained  similar  results  with  six- 
year  obligor  data  systems  technicians.  They 
suggest  deemphasis  of  irrelevant  electronics  theory 
in  favor  of  more  practical  training. 

Helicopter  Training.  The  studies  discussed  in 
this  section  were  reviewed  in  a  previous  chapter  of 
this  report.  The  emphasis  then  was  on  dependent 
measures;  now  it  is  on  evaluation. 

Greer,  Smith,  and  Hatfield  (1967)  wished  to 
control  for  chcckpilot  personal  bias  in  rating 
rotary  wing  students.  The  resultant  ratings 
reflected  the  chcckpiiot’s  own  standards  rather 
than  the  student’s  flying  skill.  The  training  pro¬ 
gram  was  analyzed  into  maneuver  components. 
Proficiency  scales  and  instrument  observation  were 
substituted  for  the  chcckpiiot’s  own  method.  The 
Mot  Performance  Description  Record  (PPDR)  was 
constructed  to  reflect  the  most  critical  aspects  of 
each  maneuver.  The  PPDR  was  administered  to  50 
advanced  and  50  intermediate  students.  The 
results  demonstrated  that  (a)  reliability  of  flight 
proficiency  evaluation  improved;  (/;)  the  PPDR 
recorded  specific  student  deficiencies;  (o)  check- 
pilots  who  were  trained  in  PPDR  were  more 
consisont  in  their  evaluation  than  chcckpilots  who 
were  only  oriented  in  PPDR;  and  (</)  chcckpilot 
training  is  necessary  in  the  use  of  the  PPDR. 

Another  approach,  used  by  Greer  (1968),  to 
compensate  for  the  variations  in  chcckpilot  stand¬ 
ards  involves  grouping  chcckpilots  with  similar 


standards.  Chcckpilots  were  asked  to  complete  an 
1 1  -point  rating  fonn,  and  those  who  agreed  at  .90 
or  better  were  paired  together.  In  their  actual 
evaluation  duties,  they  correlated  .65.  It  seems  as 
though  the  earlier  approach  (Greer  et  at..  1967)  is 
more  fruitful,  since  their  chcckpilots  became 
better,  less  biased  observers  of  behavior,  while  in 
this  latter  study  (Greer,  1968),  the  chcckpilots’ 
bias  is  still  allowed  to  operate. 

Duffy  (1968)  and  his  associates  (Duffy  & 
Anderson,  1968;  Duffy  &  Jolley,  1968)  produced 
an  objective  and  detailed  scoring  record.  Students 
were  scored  on  checkrides  during  and  after  train¬ 
ing  to  yield  a  class  percentage  error.  This 
procedure  allows  for  class  comparisons,  grade 
comparisons,  and  instructor  comparisons.  If  partic¬ 
ular  errors  are  identified  among  the  students  of 
one  instructor,  the  instructor  is  given  additional 
instructor  training.  Finally,  if  one  chcckpilot  is 
more  strict  than  the  others,  he  is  given  counsel  to 
make  his  observations  more  conforming. 

Officer  Training.  Glickman  and  Vallancc  (1967) 
wished  to  find  those  aspects  of  the  OCS  cur¬ 
riculum  which  were  most  and  least  relevant  to  the 
job  requirements  or  ensigns  on  destroyers.  One- 
thousand  critical  incidents  were  collected  and 
classified  as  to  “taught”  and  “not  taught.”  Check¬ 
lists  containing  100  of  the  resultant  items  were 
sent  to  30  to  50  high-level  officers.  They  were 
required  to  judge  the  length  of  time  in  service  alter 
which  the  new  officer  should  be  able  to  handle  the 
incident.  The  sooner  an  ensign  was  expected  to 
cope  with  an  incident,  the  more  important  that  it 
be  learned  in  OCS.  Human  relations,  personnel 
administration,  and  leadership  skills  were  found  to 
be  more  important  in  this  context  than  technical 
skills. 

Morsh  (1969)  administered  an  officer  manage¬ 
ment  inventory  to  10,242  Air  Force  officers  who 
ranged  in  rank  from  lieutenant  through  colonel. 
The  management  inventory  consists  of  a  listing  of 
tasks  and  duties,  and  a  listing  of  military  education 
topics.  The  officers  rated,  on  an  eight-point  scale, 
the  extent  to  which  each  task  is  a  part  of  their  job. 
and  the  extent  to  which  each  educational  topic  is 
useful  in  their  job.  Forty-three  managerial  types 
were  derived  from  this  analysis,  although  there  was 
much  overlap  across  types.  The  extent  of 
managerial  responsibility  was  directly  related  to 
officer  grade.  Also  identified  were  training  needs 
in  leadership,  communication,  creative  and  logical 
thinking,  problem  solving,  officer  ethics,  discipline 
and  morale,  and  military  customs  and  security. 
Other  training  topics  were  found  to  be  of  little 
use. 
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Task  Analytic  Methods.  Stewart  (1970)  used 
task  analysis  to  evaluate  training  effectiveness. 
Military  task  data  were  collected  and  analyzed  to 
determine  the  extent  to  whiclt  it  is  job  oriented. 
Stewart  found  that,  in  terms  of  cost,  overtraining 
was  as  significant  a  problem  as  undertraining. 

Siegel  and  Schultz  (1961)  and  Siegel,  Schultz, 
and  Federman  (1961)  designed  a  system  of  train¬ 
ing  evaluation  using  matrix  concepts.  Essentially, 
training  is  acceptable  if  the  average  trainee 
performs  with  proficiency  on  a  highly  important 
task.  Training  is  poor  if  the  average  worker 
performs  poorly  on  a  very  important  task  and  is 
very  proficient  on  a  task  of  low  importance.  This 
technique  can  yield  a  training  index,  an  overtrain¬ 
ing  index,  and  an  undertraining  index  for  the 
entire  training  program.  In  addition,  this  method 
points  to  deficiencies  in  the  program  which  need 
emphasis  and  parts  of  the  program  which  need 
deemphasis.  Schultz  and  Siegel  (1962a,  1962b) 
applied  the  technique  to  posttraining  performance 
of  four  Naval  ratings.  The  results  demonstrated 
that  none  of  the  groups  were  undertrained,  while 
two  of  the  groups  seemed  overtrained. 

Aircraft  Recognition.  Whitmore,  Cox,  and  Fricl 
(1968)  performed  a  study  concerned  with  ground 
to  air  recognition  training.  The  original  training 
program  for  this  aspect  of  aircraft  recognition  was 
thought  to  be  inadequate.  First,  ground  to  air 
recognition  slides  were  selected  (16  Soviet  and 
American  jet  fightcr/attack  aircraft).  The  paired- 
comparison  method  was  employed  to  train  in  the 
discrimination.  Eight-second  exposures  were  given 
during  training  while  five-second  exposures  were 
selected  for  the  test.  The  results  demonstrated  that 
(a)  16  sessions  were  needed  to  achieve  a  95  per¬ 
cent  average  recognition  level;  (b)  class  average  on 
degraded  images  was  61  percent;  (c)  degraded 
images  correlated  .82  with  the  training  achieve¬ 
ment  tests,  indicating  that  the  skill  learned  during 
training  was  not  specific  to  the  training  slides;  and 
(d)  trainees  maintained  approximately  the  same 
position  in  class  from  achievement  test  to  achieve¬ 
ment  test. 

Summary 

This  chapter  began  with  a  discussion  of  some 
generally  recognized  trends.  The  most  important 
trend  seemed  to  be  increased  recognition  of  the 
multidimensionality  of  criterion  measures.  Next, 
there  was  a  discussion  of  training  needs  and 
deficiencies  followed  by  a  very  critical  discussion 
of  trends  in  cross-cultural  training.  This  was 
followed  by  a  presentation  of  some  studies  con¬ 
cerned  with  achievement  measures  as  predictors  of 


later  success.  Then  there  were  reviews  of  studies 
involving  sensitivity  training,  programmed 
instruction,  CA1  instruction,  basic  education, 
training  and  evaluation,  and  instructor  evaluation. 
The  final  portion  of  this  chapter  was  devoted  to 
recent  military  research  including  electronics  tech¬ 
nician  training,  helicopter  training,  officer  training, 
task  analytic  methods  of  evaluation,  and  aircraft 
recognition. 


VI.  COMPARATIVE  EVALUATION 

This  chapter  is  divided  into  two  parts.  The  first 
section  involves  comparative  evaluation  studies  of 
non-low-aptitude  men,  while  the  second  section 
focuses  on  low-aptitude  evaluations.  Generally,  die 
studies  reported  here  involve  a  relative  comparison 
between  two  or  more  methods  of  instruction  or 
training.  In  many  cases,  a  new  training  method  is 
compared  with  a  standard  method  to  determine  if 
the  lattci  should  be  replaced  by  the  former. 

Comparative  Studies  of  Subjects  Within  Average 
or  Higher  Aptitude  Ranges 

Stcincmann,  Coady.  Harrigan,  Matlock,  and 
Steadman  (1969)  compared  six-year  oblrgoi 
electronics  technicians  with  four-year  obligors  who 
are  given  less  training.  Six-year  obligors  were 
found  to  perform  better  on  troubleshooting  tesU, 
test  equipment  examinations,  written  thcor) ,  and 
equipment  tests.  Questionnaire  data  on  School 
limitations  in  troubleshooting  were  verified  b>  the 
relative  weakness  found  in  this  area  as  indicated  bj 
performance  tests. 

Hurlock  (1971)  grouped  electronics  technician 
training  objectives  into  four  short  CAI  lessons. 
Fifty  randomly  selected  students  were  given  CAI, 
and  180  were  given  traditional  training.  All 
subjects  took  the  same  final  examination.  The 
results  demonstrated  that  overall  achievement  was 
10  percent  higher  for  CAI  students.  In  addition, 
CAI  instruction  reduced  training  time  48.5  percent 
( 1 7  hours  to  8  3/4  hours). 

Askren  and  Valentine  (1970)  were  interested  in 
(he  differences  between  Air  Force  instructors  with 
job  experience  and  without  job  experience  in 
teaching  a  specialty  area.  The  criteria  used  were 
student  grades,  student  critiques,  and  supervisory 
evaluation.  Seventy  instructors  and  585  students 
were  used  as  subjects.  Their  conclusions  were  that 
(r?)  there  were  no  significant  differences  in  overall 
course  grades  across  instructor  type-  in  a 
pneudraulics  course,  (b)  there  was  an  interaction 
for  an  environmental  system  course  such  that 
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grades  of  students  from  field-experienced  tcacurs 
increased  from  the  beginning  to  the  end  of  the 
course  and  decreased  for  non-field-expcricnccd 
teachers  from  the  beginning  to  the  end  of  the 
course;  (c)  there  were  no  significant  differences  in 
the  student  critiques;  (d)  licld-cxpcricnced 
teachers  were  given  an  average  supervisory  rating 
of  3.22  (on  a  five-point  scale)  while  non-field- 
experienced  instruction  received  an  average  rating 
of  3.06;  ( e )  a  small  number  of  the  rating 
categories -knowledge  of  subject,  student  interest, 
and  student  participation-caused  most  of  the 
difference;  and  (J)  the  job-experienced  instructors 
were  better  at  teaching  theory.  These  investigators 
concluded  that  there  is  little  practical  difference  in 
instructor  type,  but,  if  a  shortage  of  field- 
experienced  instructors  exists,  field-experienced 
persons  should  be  used  in  practical,  shop  related 
courses. 

Tallmadgc  (1968)  attempted  to  study  the  inter¬ 
actions  between  trainee  characteristics  (e.g.,  apti¬ 
tudes  and  interests)  and  training  methods.  A  one- 
week  segment  of  Navy  radarman  school  students 
was  used  as  a  setting  for  this  experiment.  In 
addition,  a  32-item  criterion  test  was  developed. 
Three  experimental  conditions  were  involved:  (a) 
subjects  taught  using  rote  memorization  methods. 
( b )  subjects  taught  problem  solving,  principles,  and 
rationale  approach;  and  (c)  a  standard  approach, 
which  is  a  mixture  of  other  two  methods.  The  16 
aptitude  and  interest  measures  did  not  interact 
with  the  three  training  methods  as  hypothesized. 
Perhaps  the  wrong  training  methods  or  the  wrong 
aptitude  and  interest  measures  were  used.  It  is  also 
possible  that  other  interactions  existed  which 
obscured  the  hypothesized  interactions.  Subjects 
in  the  rationale  and  understanding  condition 
performed  significantly  better  on  the  criterion  lest 
than  the  others,  thus  supporting  the  contention 
that  this  approach  results  in  a  hierarchically  higher 
type  of  learning  with  better  retention. 

McFann,  Buchanan,  Lyons,  Ward,  and  Waits 
(1958)  compared  a  conventional  Known  Distance 
marksmanship  training  course  with  a  ncwTrainfire 
1  rifle  marksmanship  course.  After  four  weeks  of 
tiaining,  both  groups  received  target  detection  and 
the  Trainfirc  I  marksmanship  proficiency  tests,  as 
well  as  the  conventional  Known  Distance  test.  Tire 
results  demonstrated  that  Trainfirc  I  training 
produced  (a)  a  greater  number  of  detected  targets, 
(/;)  a  shorter  latency  of  target  detection,  (c)  more 
target  hits;  (</)  a  higher  percentage  of  men  qualify¬ 
ing  (the  sum  of  marksman,  sharpshooter,  or 
expert),  and  (e)  fewer  qualifying  as  expert  on  the 
Known  Distance  range. 


Olmstcad  (1968)  compared  Quick  Kill  Basic 
Rifle  Marksmanship  training  (QKBRM)  with  tradi¬ 
tional  Basic  Rifle  Marksmanship  training  (BRM). 
QKBRM  involves  training  the  student  to.cngagc  a 
target  without  aligning  the  sights  of  the  weapon. 
Two  experimental  groups  received  QKBRM  in 
their  training  and  one  control  group  received  tradi¬ 
tional  BRM  training  (total  N  =  824).  One  of  the 
experimental  groups  received  a  pre-training  and  a 
post-training  questionnaire,  and  the  other  experi¬ 
mental  group  received  only  a  post-training 
questionnaire.  Control  and  experimental  groups 
were  compared  on  gains  in  confidence,  attitude 
toward  BRM,  and  drill  sergeant  attitudes  toward 
QKBRM.  Findings  indicated  an  increase  in  con¬ 
fidence  in  both  groups  with  QKBRM  trainees 
gaining  more  confidence  than  traditional  BRM 
trainees.  The  drill  sergeant’s  attitude,  though,  was 
only  somewhat  favorable.  One  undeniable  method¬ 
ological  weakness  in  this  study  is  that  the  authors 
did  not  report  any  proficiency  or  marksmanship 
data  across  experimental  groups. 

Another  study  in  this  group  concerns  the 
effectiveness  of  an  apparatus  used  as  a  simulator  in 
driver  training.  The  simulator-trained  group  was 
found  to  be  superior  in  this  experiment  to  the 
group  trained  on  a  projection-type  driver  trainer 
(Jcanthcau  &  Anderson,  1966). 

Caro  and  lslcy  (1966)  used  four  groups  of  33 
subjects  each  in  a  study  of  Naval  helicopter  flight 
training.  Groups  A  and  B  flew  a  training  device 
3.17  and  7.13  hours,  respectively.  Two  control 
groups.  C  and  C\  received  no  device  training.  The 
Fisher  exact  probability  test  demonstrated  that 
both  device  groups  had  fewer  eliminations  from 
training  than  did  both  control  groups  (10  percent 
to  30  percent  at  /K.006).  In  addition,  the  control 
groups  had  more  unsatisfactory  and  below-avcra^p 
grades  than  did  th?  two  experimental  groups. 

In  another  study,  lslcy,  Caro,  and  Jolley  (1968) 
examined  the  advantage  of  a  modified  fixed  wing 
device  as  a  synthetic  trainer  for  rotary  wing  proce¬ 
dures  and  aircraft  control.  Three  groups  of  trainees 
were  used  each  with  0,  10,  and  20  hours,  respec¬ 
tively,  of  synthetic  training  time.  The  experi¬ 
menters  found  no  difference  in  time  to  complete 
the  course  or  in  helicopter  flight  performance. 

lslcy  (1968)  and  lslcy  and  Caro  (1-969),  in 
similar  studies,  examined  the  effects  of  a  fixed 
wing  rotary  aircraft  instrument  trainer.  Warrant 
officer  candidates  were  divided  into  three 
treatments  with  0,  10,  and  20  hours,  respectively, 
of  synthetic  training.  The  criteria  used  were  devia¬ 
tions  from  regulation  on  10  flight  parameters  in  a 
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chcckridc.  Tlic  results  dramatically  favored  the 
group  with  no  synthetic  training  in  that  they 
performed  as  well  or  better  than  the  20-hour 
group.  The  authors  of  this  study  seriously 
questioned  use  of  the  simulator. 

Rhodes  (1950)  attempted  to  compare  a'  new 
and  an  old  ejection-scat  trainer.  The  new  trainer 
was  more  mobile,  not  as  high,  and  more  realistic  in 
that  it  had  a  dummy  cockpit.  Training  consisted  of 
film,  a  lecture,  and  an  ejection.  Attitude  was 
measured  in  both  an  “old”  and  a  “new”  group 
before  and  after  ejection  on  each  device.  A  group 
of  reserve  pilots  was  used  as  a  control.  No  differ¬ 
ences  were  found  across  groups;  therefore,  each  is 
regarded  as  equally  effective.  Attitude  did  improve 
for  both  groups  combined  with  reference  to  gain 
scores  (/;<.01).  The  author  concluded  that,  regard¬ 
less  of  device,  overall  ejection-seat  training  tends 
to  increase  confidence  and  decrease  fear  of  this 
bailout  method. 

Gabriel  and  Burrows  (1968)  performed  a  study 
of  pilot  time-sharing  performance.  Time-sharing  is 
concerned  with  alternating  attention  between  two 
or  more  sources  of  information.  Specifically,  the 
pilot  uses  his  instrument  panel  so  much  that  he  has 
little  time  to  devote  to  outside  scanning  of  the 
environment.  The  training  task  in  this  study  was  to 
improve  the  perception  of  midair  threats  of 
collision.  The  results  suggested  that  use  of  the 
simulator  can  increase  efficiency  of  pilot  time¬ 
sharing  between  intra-  and  extra-cockpit  stimuli. 

Ward,  Books,  Kern,  and  McDonald  (1970) 
wished  to  determine  if  the  Basic  Combat  Training 
(BCT)  and  the  Advanced  infantry  Training  (AIT) 
courses  could  be  integrated 'for  a  sample  of  con¬ 
scientious  objectors  in  medical  corpsman  training. 
The  content  of  the  training  courses  currently  used 
was  catalogued.  A  job  activities  questionnaire  was 
developed  reflecting  emergency  medical  care  and 
secondary  and  recuperative  treatment.  The  four 
types  of  tasks  included  in  the  training  were 
company  aidman,  evacuation  medic,  aid-station 
dispensary  medic,  and  ward  nursing  care  medic. 
The  criteria  for  selecting  these  groupings  were 
availably  of  supervision,  frequency,  and  oppor¬ 
tunity  for  on-the-job  training.  In  the  resultant 
16-week  course,  practical  work  was  emphasized 
and  lecture  was  dccmphasi/.cd.  A  large  amount  of 
TV  instruction  was  used  for  80  experimental 
students.  For  80  other  students,  traditional  train¬ 
ing  was  involved.  Combat  proficiency,  aidman 
proficiency,  and  attitude  questionnaires  were 
administered  to  all  the  trainees.  In  addition,  an 


evaluation  questionnaire  was  given  to  the  instruc¬ 
tors.  The  results  of  this  effort  demonstrated  that 
(a)  on  military  proficiency  tests,  both 
experimental  and  control  groups  performed 
equally  well;  ( b )  control  subjects  performed  better 
on  the  Basic  Combat  Proficiency  Test;(c)  experi¬ 
mental  subjects  did  better  on  physical  skills  used 
by  medical  corpsmcn;(r/)  there  were  no  significant 
differences  in  written  knowledge  test's;  (c)  experi¬ 
mental  subjects  performed  better  on  medical 
performance  tests;  (/)  experimental  subjects  had  a 
higher  opinion  of  the  Army  and  its  training  than 
did  standard  subjects;  and  (#)  instructors  thought 
the  experimental  program  was  superior. 

Judisch,  Cooper,  Francis,  and  Ray  (1968)  in¬ 
vestigated  the  present  curricula  and  job  require¬ 
ments  of  graduating  medical  corpsmcn  from  two 
schools.  They  found  that  on  knowledge  tests  San 
Diego  students  performed  better  on  anatomy, 
physiology,  first  aid,  and  nuclear  biological  and 
chemical  warfare.  On  the  other  hand,  Great  Lakes 
students  were  superior  in  patient  care.  A  perform¬ 
ance  decrement  was  found  over  time  such  that,  24 
weeks  post-training,  graduates  were  10  percent 
worse  than  current  students,  and  graduates  of  over 
24  weeks  were  16  percent  worse.  Also,  a  survey 
was  performed  to  determine  how  much  and  where 
prior  knowledge  and  information  were  acquired. 
Students  reported  gaining  prior  knowledge  from 
lectures,  films,  readings,  practical  experience,  and 
other  visual  aids.  In  all,  though,  this  knowledge 
accounted  for  only  10  percent  of  the  school 
knowledge.  It  was  also  found  that  San  Diego 
students  learned  more  from  lectures  than  did 
Great  Lakes  students,  and  that  Great  Lakes 
students  learned  more  from  reality  than  did  San 
Diego  students.  As  a  consequence  of  these  results, 
the  authors  recommended  revision  in  the  cur¬ 
riculum. 

Richlin,  Federman,  and  Siegel  (1958)  compared 
general  Naval  technical  training  with  a  more 
specialized  type  of  training  under  the  Selective 
Emergency  Service  Rate  Program  (SESR).  Each 
Naval  rating  in  this  program  is  subdivided  and 
given  a  more  specialized,  shorter  type  of  training. 
After  training  the  men  arc  utilized  mostly  in  tasks 
for  which  they  were  trained.  A  Technical  Behavior 
Checklist  (TBCL)  was  developed  as  a  criterion  of 
performance  for  aviation  machinist  mates  in  the 
SESR  program.  Items  for  the  TBCL  were  derived 
from  tasks  selected  for  their  importance  to  the 
job,  time  consumed,  and  variability.  The  results  of 
this  study  demonstrated  that  graduates  of  the 
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SESR  program  were  equal  to  or  better  than  the 
graduates  of  the  more  generalized  program.  Several 
other  SESR  studies  were  performed.  In  these 
studies  it  was  demonstrated  that  (a)  SESR  trained 
air  controllers  performed  as  well  as  generally 
trained  air  controllers  except  in  tower  operations 
(Siegel,  Richlin,  &  Federman,  1958);  ( b )  SESR 
trained  parachute  riggers  performed  as  well  or 
better  than  generally  trained  parachute  riggers 
(Siegel,  Richlin,  &  Federman,  1958);  and  (c)  SESR 
trained  avionics  technicians  performed  as  well  or 
better  than  pre-SESR  trained  avionics  technicians 
(Richlin,  Siegel,  &  Schultz,  1960), 

Siegel,  Federman,  and  Richlin  (1959)  adminis¬ 
tered  a  series  of  interviews  to  officers  and  petty 
officers  in  order  to  assess  their  opinion  of  the 
SESR  program.  One  problem  identified  was  the 
difficulty  of  assigning  tasks  to  a  more  specialized 
man.  Some  supervisors  felt  SESR  trained  graduates 
achieved  competence  earlier,  but  that  the  more 
generally  trained  men  were  more  useful. 

CAl  and  TV  Instruction  Gallagher  (1970) 
attempted  to  investigate  relevant  learner  charac¬ 
teristics  and  optimal  types  of  instruction.  He  used 
four  treatments:  (a)  computer  assigned  sequence 
of  instruction-instructor  evaluated  product;  (h) 
computer  assigned  sequence  of  instruction- 
computer  evaluated  product;  (c)  student  selected 
sequence-instructor  evaluated  product;  and  (d) 
student  selected  sequence-computer  evaluated 
product.  Separate  analyses  of  variance  were 
conducted  on  the  emergent  data  for  four  depen¬ 
dent  variables:  midterm  examination,  final 
product  score,  terminal  or  system  time  use,  and 
time  to  complete  cognitive  portion  of  task.  The 
results  indicated  that  (a)  there  were  no  significant 
effects  on  any  of  the  dependent  measures;  ( b ) 
both  self-sequenced  groups  achieved  superior 
performance  on  three  of  four  dependent  measures; 
(c)  the  computer  assigned  sequence  of  instruction 
was  best  in  terms  of  cost;  (d)  those  who  performed 
best  on  the  dependent  measures  were  enthusiastic 
about  the  computer  presentation;  and  (e)  in¬ 
dividual  differences  were  minimized  in  the  com¬ 
puter  evaluated  group.  In  conclusion,  specific 
learner  characteristics  v/cre  related  to  success,  and 
the  student  selected-computer  evaluated 
approach  was  best  in  terms  of  costs. 

Fishman,  Keller,  and  Atkinson  (1968)  used  CAI 
to  present  spelling  drills  to  29  fifth-grade  students. 
Some  words  were  presented  via  distributed 
practice,  and  other  words  were  presented  with 
massed  practice.  The  results  demonstrated  that  at 
the  end  of  training  the  massed  trials  produced 
more  correct  responses,  but  10  and  20  days  later, 
the  distributed  practice  group  was  superior 
(pC.025). 


In  another  study,  Rawls  and  Rawls  (1968) 
found  no  significant  differences  in  achievement 
and  retention  between  conventional  lecture  pres¬ 
entation  and  closed  circuit  TV.  College  students, 
though,  regarded  the  TV  instruction  unfavorably 
and  preferred  classroom  instruction.  This  was  true 
even  among  those  who  achieved  high  grades  or  had 
previous  TV  courses.  The  students  were  observed 
looking  at  the  TV  set  only  20  percent  of  the  time, 
while  they  looked  at  the  lecturer  42  percent  of  the 
time. 

Fidelity.  Grimsley  (1969a,  1969b)  proposed  to 
study  the  effects  of  variations  in  fidelity  upon 
acquisition,  transfer,  and  retention  in  group  train¬ 
ing  procedures.  There  were  12  trainees  per  condi¬ 
tion,  trained  in  groups  of  four  on  the  Nikc- 
Hcrcules  missile.  They  used  a  real  (electric),  a  cold 
(non-electric),  or  an  artist’s  sketch  of  the  control 
panel.  The  subjects  were  tested  immediately  after 
training,  four  weeks  later,  and  six  weeks  later  on 
the  92-step  missile  firing  procedure.  No  differences 
were  found  in  training  time,  post-training  perform¬ 
ance,  performance  after  four  and  six  weeks,  and  in 
retraining  time  (after  six  weeks).  This  study 
suggests  that  a  considerable  saving  of  costs  can  be 
achieved  by  using  a  low-fidelity  device.  Similar 
results  were  found  by  Grimsley  (1969a,  1969b)  in 
a  study  that  was  identical  except  that  group  train¬ 
ing  procedures  were  not  used. 

Reduced  Training  Time.  Longo  and  Mayo 
(1967)  wished  to  determine  if  the  19-wcck  air¬ 
borne  electronics  training  course  could  be 
decreased  in  time  to  14  weeks.  Two  matched 
samples  of  trainees  were  used  (total  N  =  308).  The 
results  proved  disappointing  since  students  in  the 
longer  course  performed  better  than  students  in 
the  shorter  course. 

Johnson  and  Salop  (1968)  observed  that  regular 
track  avionics  fundamentals  training  requires  16 
weeks  while  accelerated  track  training  needs  only 
10  weeks.  The  accelerated  course  differs  from  the 
standard  course  only  in  speed  and  amount  of 
redundancy.  In  addition,  only  students  of  high 
ability  are  assigned  to  the  accelerated  track.  It  was 
found  after  training  that  accelerated  students 
scored  2.6  points  below  students  of  the  same 
ability  on  the  single  track  program,  but  5.9  points 
higher  than  all  one  track  students,  and  20.8  points 
higher  than  that  required  to  graduate.  The  authors 
estimated  that  use  of  accelerated  training  in 
avionics  fundamentals  can  save  $750,000  a  year. 

Valvcrde  (1969)  decided  to  apply  a  systems 
approach  to  electronics  maintenance  training. 
First,  behavior  descriptions  were  derived  from  task 
analysis  of  the  job  requirements  followed  by  the 
construction  of  performance  tests  based  on  the 
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objectives.  Then  a  14-week  experimental  training 
course  was  constructed  for  subjects  with 
electronics  aptitude  scores  ranging  from  the  60th 
to  the  95th  percentile.  This  group  received  only 
enough  electronics  theory  to  do  the  job.  Another 
group  with  aptitude  scores  of  80  or  better  received 
the  traditional  24-week  course  including  10  weeks 
of  electronics  principles.  The  experimental  group 
was  divided  into  two  groups:  60th  to  75th  per¬ 
centile  and  80th  to  95th  percentile.  The  results 
demonstrated  that  (a)  the  high-aptitude  experi¬ 
mental  group  performed  better  on  the  perform¬ 
ance  test  than  the  medium-aptitude  experimental 
group,  which  performed  better  than  the  tradi¬ 
tionally  trained  control  group;  ( b )  the  control 
group  scored  better  on  special  theory  and  job 
knowledge  tests;  and  (c)  the  cost  of  the  experi¬ 
mental  program  was  less  than  the  cost  of  the  tradi¬ 
tional  program. 

Mental  Health.  Kumpan  (1965)  was  interested 
in  the  effect  of  training  on  psychiatric  aids  in  a 
mental  hospital.  The  trainees  consisted  of  48 
experimental  subjects  taking  a  four-month  training 
program  and  48  control  subjects.  There  were  two 
experimental  wards  of  30  patients  each  with  the 
48  experimental  aids  rotating  among  them. 
Kumpan  found  that  the  patients  in  the  experi¬ 
mental  wards  did,  indeed,  improve.  Psychiatric 
aids  usually  have  the  most  contact  with  patients, 
but  they  are  ill-qualified  to  help  them  because 
they  do  not  understand  the  causes  of  mental 
illness. 

Cochran  and  Steiner  (1966)  used  an  experi¬ 
mental  group  of  58  attendants  for  the  retarded. 
They  were  given  the  Southern  Regional  Education 
Board  Test  before  and  after  training.  Sixteen 
control  attendants  were  also  used  to  determine  if 
testing  itself  can  cause  a  gain  in  posttest  scores 
without  training.  Indeed,  the  control  subjects 
gained  5.18  points  (p<.01),  while  the  experi¬ 
mental  subjects  gained  26.8  points  (p<001).  Also, 
younger  subjects  with  the  least  tenure  seemed  to 
make  the  greatest  gains. 

Poser  (1966)  performed  an  experiment  to 
answer  the  question  of  whether  special  academic 
or  intellectual  knowledge  is  required  to  perform 
group  therapy  with  schizophrenics.  The  three 
experimental  conditions  involved  (a)  45  patients 
treated  by  psychiatrists  and  trained  social  workers, 
( b )  87  patients  treated  by  students  without  any 
training,  and  (c)  63  untreated  controls.  All 
patients,  before  and  after  therapy,  were  given 
several  tests  to  differentiate  psychotic  from 
normal,  including  tapping  speed,  reaction  time, 
digit  symbol,  color-work  conflict,  verbal  fluency, 


and  the  Verdun  Association  List.  Analysis  of 
covariance  was  performed  on  the  data.  The  results 
indicated  that  (a)  four  of  six  tests  showed  signifi¬ 
cant  gains  by  the  lay  therapist  group  as  compared 
with  the  untreated  groups;  (b)  two  of  six  tests 
showed  significant  gains  as  the  result  of  therapy  by 
the  professional  therapist;  and  (c)  three  of  six  tests 
showed  significant  gains  by  the  lay  therapists  over 
the  professional  therapists. 

The  conclusion  from  this  experiment  would 
seem  to  be  that  the  use  of  lay  therapists  produced 
greater  improvement  than  the  professionally 
trained  therapist.  Of  course,  this  involved  only 
group  therapy  and  not  the  traditional  one-to-one 
situation  in  which  a  professional  is  most  certainly 
needed. 

Leadership  Training.  Rittenhouse  (1953) 
compared  two  samples  of  enlisted  men,  one  of 
which  attended  noncommissioned  officer  (NCO) 
leadership  school.  Both  groups  were  compared  on 
rank,  assignment,  and  awards.  The  school  group 
seemed  to  have  a  higher  final  rank  and  the  non¬ 
school  group  had  a  greater  gain  in  rank,  but  these 
differences  were  not  statistically  significant.  The 
school  graduate  group  had  more  infantry  assign¬ 
ments  (47.2  percent  and  36.7  percent).  Also,  a 
greater  proportion  of  the  school  graduate  group 
received  combat  infantry  badges. 

Hood,  Showel,  and  Stewart  (1967)  contrasted 
three  methods  of  NCO  leadership  training  with  a 
non-training  group.  The  trained  leaders  demon¬ 
strated  (a)  higher  evaluations,  ( b )  greater  esprit  de 
corps  among  their  subordinates,  ( c )  better  profi¬ 
ciency  test  performance,  ( d)  better  preparation, 
briefing,  and  control  of  their  men,  and  (e)  more 
frequent  structuring  and  use  of  rewards  and  defini¬ 
tions. 

Barrett  (1965)  attempted  to  measure  the 
impact  of  a  90-hour  executive  training  program  of 
the  City  of  New  York  through  comparison  with  a 
control  group  which  did  not  undergo  training 
(total  N  =  255).  The  results  demonstrated  no 
differences  across  groups  in  before-  and  after- 
performance  ratings  by  peers  and  supervisors.  The 
only  measurable  changes  were  increases  in  con¬ 
sideration  and  in  initiating  structure  in  the  trainees 
and  a  decreased  critical  attitude  toward  subordi¬ 
nates. 

Armor  Training.  The  Human  Resources  Re¬ 
search  Organization  (Baker,  Cook,  Warnick,  & 
Robinson,  1964)  developed  and  evaluated  a 
system  for  conducting  tactical  training  of  tank 
platoon  crews.  The  tank  crews  themselves  were 
trained  on  a  miniature  battlefield  with  radio 
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controlled  tanks  and  simulated  terrain.  The  tank 
commanders  were  trained  on  the  Army  Combat 
Decisions  Game  using  tank  models  on  a  terrain 
board.  A  field  performance  test  was  then  adminis¬ 
tered  to  the  experimentally  trained  crews  and  to  a 
group  of  matched  controls.  The  crew  receiving 
experimental  training  obtained  significantly  higher 
scores  than  the  matched  control  crews. 

Olson  and  Baerman  (1955)  wished  to  determine 
if  a  brief  course  in  gas  conservation  had  any  effects 
on  fuel  consumption  in  the  M48  tank.  The  three 
experimental  conditions  were  (a)  control-rotated 
among  tanks  in  unit,  ( b )  control-kept  own  tank, 
and  (c)  experimental— received  instruction  in  fuel 
economy.  These  researchers  found  that  the  experi¬ 
mental  group  used  less  fuel  when  considerable 
stop-and-go  driving  was  involved. 

Reading  and  Verbal  Instruction.  Seventy-two 
scientists  and  engineers  were  trained  for  reading 
using  a  book  method,  and  42  were  trained  using 
mechanical  machines  (Jones  &  Carran,  1965). 
Different  forms  of  the  Diagnostic  Reading  Test 
were  given  before  and  after  training.  All  subjects 
were  found  to  have  gained  significantly  after  train¬ 
ing,  but  in  a  followup  18  months  later,  the  book 
approach  was  shown  to  be  superior.  In  fact, 
performance  of  the  machine  trained  group  actually 
decreased  after  the  time  period,  while  performance 
of  the  book  trained  group  continued  to  increase 
0K.002). 

Kelley  and  Mech  (1967)  wished  to  ascertain  if  a 
reading  laboratory  course  could  produce  an 
increase  in  grade  point  average  among  college 
students.  Twenty-three  experimental  subjects  were 
matched  with  23  controls.  After  three  semesters 
no  significant  differences  in  grade  point  average 
were  found.  The  investigators  then  divided  their 
experimental  and  control  groups  by  academic 
major.  They  found  that  (a)  among  education 
majors  there  was  a  statistically  significant  differ¬ 
ence  after  three  semesters  QK. 025);  ( b )  there  was 
also  a  statistically  significant  difference  among 
science  and  mathematics  students (/?<.01);  and  (c) 
there  were  no  significant  differences  among  social 
studies  and  literature  majors.  Perhaps,  the  educa¬ 
tion,  science,  and  mathematics  majors  had  an 
initially  greater  decrement  in  verbal  ability,  leaving 
a  great  deal  more  room  for  improvement.  Also, 
education  majors  may  have  had  a  greater  interest 
in  reading  improvement. 

Frase  (1969)  taught  48  undergraduates  verbal 
materials  using  two  different  methods  of  presenta¬ 
tion.  One  method  used  a  horizontal  display  of 
associations  while  the  other  used  a  vertical  tabulai 


display  of  associations.  The  results  showed  that 
the  horizontal  methods  yielded  superior  learning, 
yet  the  subjects  preferred  the  vertical  tabular 
display. 

Comparative  Studies  of 
Low-Aptitude  Subjects 

Skill  Acquisition.  Van  Matre  and  Steincman 
(1966)  trained  26  low-aptitude  men  in  an  elec¬ 
tronics  technician  course  in  a  shorter  period  of 
time  and  gave  them  skills  more  immediately  useful 
on  the  job.  This  group  was  compared  with  24 
conventionally  trained  personnel  in  a  fleet  follow¬ 
up  using  perfomiance  tests,  ratings,  interviews,  and 
written  tests.  The  results  demonstrated  that  the 
performance  of  the  experimental  group  was 
adequate  and  not  significantly  different  from  the 
conventional  group  in  proficiency. 

Van  Matre  and  Harrigan  (1970)  compared  the 
performance  of  54  marginally  qualified  electrical 
technicians  with  51  well-qualified  electrical  tech¬ 
nicians  who  underwent  training.  These  groups 
were  compared  after  they  were  on  the  job  in  the 
fleet  for  24  months.  A  rating  scale  and  a  struc¬ 
tured  interview  score  were  used  as  criteria.  The 
conventionally  trained  men  were  rated  as  more 
capable  in  troubleshooting  and  use  of  test  equip¬ 
ment,  but  were  not  generally  rated  differently 
from  low-aptitude  men.  In  fact,  the  lowest  ratings 
obtained  by  low-aptitude  men  were  average. 

Mayo  (1969)  administered  an  aviation  struc¬ 
tural  mechanic  course  to  30  Category  IV  per¬ 
sonnel,  i.c.,  tiie  lowest  30  percent  on  the  Armed 
Forces  Qualification  Test  (AFQT).  The  fleet 
perfomiance  of  this  group  was  then  compared 
with  that  of  personnel  who  scored  above  the  30th 
percentile.  Among  the  low-aptitude  men,  perform¬ 
ance  varied  from  highly  satisfactory  to  unsatis¬ 
factory  with  no  way  of  predicting  which  men 
would  perform  adequately.  Low-aptitude  men 
were  found  to  have  lower  ratings  (]K. 05)  than  the 
other  groups.  Based  on  these  results,  Mayo 
suggested  that  Category  IV  personnel  should  not 
be  used  for  this  Naval  rating  unless  there  is  a  man¬ 
power  shortage.  It  is  noted,  however,  diat  the 
comparison  group  was  given  25  percent  more 
training  and  that  ratings  were  used  as  criteria 
rather  than  perfomiance  tests. 

Hooprich  (1968)  wished  to  determine  the 
appropriateness  of  commissaryman  training  for 
Category  IV  personnel.  The  results,  based  on  two 
studies,  demonstrated  that  (a)  31  of  35  Category 
IV  subjects  successfully  completed  training, 
regardless  of  their  low  reading  ability,  although 
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their  grades  were  significantly  lower  than  the 
comparison  group;  ( b )  Category  IV  .subjects 
needed  to  devote  more  outside  time  to  study,  and 
they  required  more  time  from  instructors  to  meet 
criterion;  (c)  the  differences  across  groups  were 
most  evident  on  paper-and-pencil  tests  and  least 
evident  on  actual  performance  tests;  (d)  AFQT 
scores  failed  to  predict  school  performance;  and 
(e)  reading  test  scores  were  significantly  correlated 
with  some  aspects  of  performance. 

Standlec  and  Saylor  (1969)  performed  an 
equipment  operator  training  study  with  Category 
IV  subjects.  The  performance  of  six  Category  IV 
subjects  was  compared  with  16  subjects  who  were 
not  so  classified.  Then,  the  AFQT  scores  for  this 
group  and  for  eommissaryman  training  were 
combined  to  determine  if  AFQT  score  predicted 
performance.  It  was  found  that  (a)  all  Category  IV 
subjects  passed  the  course;  ( b )  scores  of  the  Cate¬ 
gory  IV  subjects  were  lower,  especially  on  written 
tests  as  opposed  to  the  more  practical  performance 
tests;  (c)  AFQT  scores  were  unrelated  to  achieve¬ 
ment;  (d)  mathematics  was  a  source  of  trouble  for 
Category  IV  personnel;  and  ( e )  Category  iV  men 
needed  more  individual  attention  and  counselling. 

Fox,  Taylor,  and  Caylor(1969)  compared  the 
performance  of  low-aptitude  men  with  higher  apti¬ 
tude  men  on  several  training  tasks:  visual  monitor¬ 
ing,  rifle  assembly,  missile  preparation,  phonetic 
alphabet,  map  plotting,  and  combat  plotting. 
Low-aptitude  groups  needed  2  to  4  times  as  much 
training  time,  2  to  5  times  more  training  trials,  and 
2  to  6  times  as  much  prompting  to  reach  criterion. 
Middle-aptitude  group  performance  was  found  to 
be  more  like  that  of  the  high-aptitude  group  than 
the  low-aptitude  group.  The  authors  concluded 
that  individual  differences  in  aptitude  must  be 
recognized,  and  training  programs  must  be 
designed  to  account  for  these  differences. 

Grunzkc,  Guinn,  and  Stauffer  (1970)  evaluated 
the  perfonnanee  of  26,915  low-aptitude  men  who 
were  taken  into  the  Air  Force  even  though  they 
were  below  the  minimum  acceptable  level.  The 
findings  demonstrated  that  the  low-aptitude  men, 
as  compared  with  subjects  with  higher  aptitude, 
had  (a)  a  smaller  percentage  completing  basic 
training,  (b)  more  disciplinary  problems,  (c)  more 
unsuitable  discharges,  and  ( d )  a  lower  percentage 
attaining  skill  level.  In  addition,  among  low- 
aptitude  men,  high  school  graduates  and  whites 
performed  better  than  high  school  non-graduates 
and  Negroes. 

In  another  study,  a  manpower  training  program 
was  surveyed  by  comparing  1,062  program  grad¬ 
uates  with  444  program  dropouts  (Trooboff, 


1968).  The  results  showed  that  84  percent  of  the 
graduates  received  employment  while  only  67 
percent  of  the  dropouts  received  employment. 
Also,  the  average  earnings  of  graduates  increased 
from  S.98  to  SI. 76  (79  percent),  while  the  average 
earnings  of  dropouts  increased  from  S1.07  to 
SI. 51  (29  percent).  Even  though  several  factors 
were  left  uncontrolled,  the  author  concluded  that 
the  program  was  successful. 

Individualized  Training.  McFann  (1969a, 
1969b)  found  that  the  differences  between  high- 
and  low-aptitude  men  in  basic  combat  tiaining 
were  greatest  on  cognitive  tasks  and  that  the 
difference  was  not  as  marked  on  motor  skills  and 
proficiency  tests,  with  most  low-aptitude  men 
meeting  standard.  In  the  study,  high-,  middle-,  and 
low-aptitude  groups  were  selected  and  trained, 
using  videotape,  a  one-to-one  student  to  teacher 
ratio,  feedback,  reinforcement,  and  small  incre¬ 
ments.  In  some  tasks,  low-aptitude  men  reached 
standard,  but  took  2  to  4  times  longer,  and  in 
other  cases  they  failed  to  master  the  material  at 
all.  McFann  also  found  that  aptitude  interacts  with 
method  of  instruction.  The  high-aptitude  group 
was  found  to  learn  equally  well  with  lecture  or 
individualized  training,  while  the  low-aptitude 
group  learned  well  with  individualized  training, 
but  not  with  lecture. 

J.  Taylor  (1970)  found  that  both  high*  and 
low-aptitude  personnel  learn  faster  when  given 
wire  splice  training  via  audiotape  and  slides  as 
compared  with  a  programmed  book.  For  the 
high-aptitude  personnel,  the  programmed  book 
required  25  percent  more  training  time;  for  the 
low-aptitude  group,  it  took  50  percent  more  train¬ 
ing  time.  From  these  results,  Taylor  suggests  that 
training  be  adapted  to  individual  differences. 

Language  Skills.  Vincbcrg,  Sticht.  Taylor,  and 
Caylor  (1970)  found  that  military  training 
manuals  were  6  to  8  grade  levels  above  the  reading 
level  of  Category  IV  personnel,  and  4  to  6  grade 
levels  above  the  reading  level  of  higher  aptitude 
subjects.  Many  of  these  individuals  relied  more 
heavily  on  asking  and  listening  to  others.  In 
another  study,  Sticht  (1969)  found  that  among 
low-aptitude  men  learning  by  listening  was  more 
effective  than  learning  by  reading,  although  some 
did  better  by  reading. 

Summary 

This  chapter  contained  reviews  of  several  com¬ 
parative  evaluation  studies.  Some  of  the  studies 
were  concerned  with  comparative  evaluation  of 
new  training  methods  while  others  were  concerned 
with  methods  of  training  low-aptitude  personnel. 
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With  regard  to  the  training  of  low-aptitude  men, 
more  practical  and  individualized  and  less 
theoretical  training  seems  superior  to  standard 
training  procedures. 


VII.  DISCUSSION 

There  has  been  an  increasing  trend  in  the  past 
decade  in  the  use  of  factor  analysis  and  other 
multivariate  statistical  techniques.  Employment  of 
these  techniques  has  been  made  more  feasible  by 
the  increased  availability  ofhigh-spced  computers. 
Many  investigators,  though,  tend  to  use  factor 
analysis  as  an  end  product  or  explanation  rather 
than  as  an  aid  in  data  analysis.  Factor  analytic 
research  can  be  misleading  since  the  factors 
derived  from  the  matrix  reduction  are  directly 
dependent  upon  the  variables  making  up  the  corre¬ 
lation  matrix.  This  is  a  question  of  content 
validity.  If  the  variable  input  is  biased,  then  the 
results  (factors)  will  be  biased.  In  addition,  most 
of  the  recent  factor  analytic  literature  has  been  so 
abstruse  that  it  is  difficult  to  understand  the  ideas 
presented,  much  less  to  implement  them. 

There  has  not  been  enough  attention  to 
canonical  correlation,  Q-factor  analysis,  and  multi¬ 
variate  research  design.  No  evaluative  studies  were 
found  in  which  the  first  two  of  these  methods 
were  used,  and  too  few  studies  using  the  latter 
were  observed.  Perhaps  some  of  these  sophisti¬ 
cated  techniques  arc  not  appropriate  to  the  data 
collected.  In  fact,  a  large  portion  of  the  data 
collected  are  not  worthy  of  any  analysis. 

A  large  portion  of  the  authors  of  the  research 
studies  reported  in  this  review  arc  guilty  of 
violating  one  or  more  of  the  following  canons  of 
statistical  methodology:  (a)  use  of  too  few 
subjects;  ( b )  use  of  inappropriate  statistical  tech¬ 
niques;  (c)  failure  to  use  control  groups,  or  use  of 
inadequate  controls;  (d)  use  of  improper  sampling 
procedures;  and  (cj  use  of  inappropriate,  con¬ 
taminated,  or  unreliable  criteria. 

Other  quantitative  methods  which  arc  given 
much  lip  service,  but  which  arc  little  used  in 
practice  except  by  their  authors,  arc  (a)  sequential 
testing,  ( b )  criterion-referenced  testing,  (c)  confi¬ 
dence  testing,  (d)  part  correlation,  (e)  magnitude 
estimation,  and  (J)  application  of  theory  of  signal 
detection.  It  behooves  other  investigators  to  try 
these  techniques.  Such  methods  can  increase  the 
sensitivity  and  gencrali/.ability  of  research  findings. 

One  method  which  others  are  beginning  to  use 
is  Campbell  and  Fiske’s  (1959)  technique  for 


establishing  convergent  and  discriminant  validity. 
Convergent  validity  exists  if  there  is  a  high  correla¬ 
tion  between  tests  purporting  to  measure  the  same 
tiling;  and  discriminant  validity  exists  when  tests 
measuring  different  factors  are  independent.  This 
technique  should  prove  very  useful  in  the  future 
for  psychometricians  involved  in  test  construction 
and  validation. 

Another  innovation  which  will  come  more  into 
vogue  is  cost-effectiveness,  or  cost-benefit,  anal¬ 
ysis.  This  criterion  is  useful,  as  for  as  any  other 
ratio,  only  if  there  is  an  adequate  data  base  for 
both  the  numerator  and  the  denominator  of  the 
ratio.  Thus,  the  technique  demands  more  precise 
economics  and  performance  evaluative  data. 

Although  the  moderator  variable  technique  is 
properly  a  subtopic  under  statistical  methods,  its 
emphasis  in  the  recent  literature  demanded  that  it 
be  given  treatment  in  a  separate  chapter  of  this 
review.  A  test  or  measure  can  be  a  moderator 
variable  when  its  use  differentially  determines  the 
predictability  of  another  test  or  measure.  Almost 
any  test  score  may  be  a  potential  moderator 
variable  as  are  race,  sex,  personality,  and  other 
background  factors. 

Cognitive  style  seems  to  differ  across  deprived 
and  non-deprived  groups  and  must  be  accounted 
for  and  taken  into  consideration  in  order  that  the 
potential  of  the  human  resources  in  our  society 
can  be  maximized. 

Several  studies  were  surveyed  which  use  race 
and  aptitude  as  moderator  variables.  One  impor¬ 
tant  conclusion  (Boehm,  1971)  to  be  drawn  from 
this  research  is  that  objective  and  performance 
oriented  dependent  measures  are  less  likely  to 
show  differences  across  racial  groups  than  the 
more  subjective  luting  methods.  Another  conclu¬ 
sion  (McFann,  1969a,  1969b)  is  that  higli-aptitudc 
groups  learn  equally  well  with  lecture  or  individ¬ 
ualized  training,  while  low-aptitude  groups  learn 
well  with  individualized  training  but  not  with 
lecture. 

Individualized  or  programmed  instruction  is 
another  major  educational  trend  which  has 
achieved  prominence  in  the  last  five  or  ten  years, 
individualized  or  programmed  instruction  repres¬ 
ents  an  amalgam  of  the  principles  of  learning 
theory  with  the  idiosyncracies  of  the  individual. 
Programmed  instruction  can  be  sequential, 
allowing  the  individual  to  proceed  in  vciy  small 
steps  through  a  fixed  instructional  sequence,  or 
branched.  Branching  allows  the  individual’s 
progress  to  be  governed  by  his  own  responses. 
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Sequential  testing  has  been  used  in  individualized 
instruction  in  order  to  ascertain  rapidly  the  level 
of  knowledge  possessed  by  the  student.  Also, 
criterion-referenced  tests,  rather  than  norm- 
referenced  tests,  have  been  employed,  since  the 
student  must  be  able  to  perform  each  unit  of 
instruction  at  a  certain  level  of  proficiency  before 
advancing  to  the  next  unit  of  instruction. 

Computer  assisted  instruction  (CA1)  is  the 
application  of  computers  to  programmed  instruc¬ 
tion.  CA1  can  be  especially  practical  when  a  large 
number  of  short  tests  must  be  given  to  the  trainee, 
and  when  instructor-student  interaction  is  not 
considered  crucial  to  learning. 

Another  noted  trend  was  an  increased  concern 
with  cross-cultural  training  and  evaluation.  Here, 
the  “cultural  assimilator”  (Fiedler,  Mitchell,  & 
Triandis,  1970:  Worchel  &  Mitchell,  1970)  seemed 
to  possess  some  merit.  In  this  method,  critical 
incidents  are  obtained  regarding  circumstances  in 
which  the  norms  of  behaviors  across  cultures  are 
quite  different.  Questions  arc  asked  about  the 
incident,  and  the  multiple-choice  answer  format  is 
employed.  The  responses  of  a  target  sample  from 
the  host  culture  arc  employed  to  provide  the 
correct  answer  keying. 

Similarly,  emphasis  on  increasing  basic  skills 
generally  and  reading  skill  specifically  has  achieved 
import.  Courses  in  reading  instruction  have 
produced  gains  in  reading  speed,  retention  of 
reading  speed,  and  transfer.  No  single  method  of 
reading  instruction  scents  to  have  demonstrated 
superiority  to  another. 

A  method  developed  by  Greer.  Smith,  and 
Hatfield  (1967)  has  to  some  degree  eliminated 
rater  bias  in  helicopter  chcckpilots.  After  a  task 
analysis,  proficiency  tests  and  instrument  observa¬ 
tion  were  substituted  for  the  chcckpilot’s  own 
evaluation  method.  This  technique  was  abic  to  (a) 
increase  the  reliability  of  evaluation,  ( b )  identify 
specific  student  deficiencies,  and  ( c )  increase 
chcckpilot  consistency. 

Siegel  and  Schultz  (1961)  and  Siegel,  Schultz, 
and  Fcdcrman  (1961)  constructed  an  evaluative 
technique  using  matrix  concepts  which  was 
successfully  applied  to  a  military  setting  (Schultz 
&  Siegel,  1962).  Tncsc  writers  feel  that  training  is 
good  if  the  average  trainee  performs  proficiently 
on  important  tasks.  Training  is  poor  if  the  average 
worker  performs  poorly  on  important  tasks.  This 
method  identifies  deficiencies  in  the  training 
program  which  need  emphasis  and  those  parts  of 
the  training  program  which  need  deemphasis. 


The  comparative  studies  discussed  in  this  review 
were  concerned  with  relative  comparisons  between 
two  or  more  methods  of  instruction  or  training.  In 
most  cases  a  new  training  method  was  compared 
with  a  standard  method  to  determine  if  the  latter 
should  be  modified  or  replaced.  Some  of  the 
conclusions  to  be  drawn  from  this  research  are 
presented. 

1.  CAI  is  superior  to  standard  instruction  for 
clectionics  technicians  in  terms  of  achieve¬ 
ment  and  speed  (Hurlock,  1971). 

2.  If  personnel  shortages  exist,  job  experi¬ 
enced  Air  Force  instructors  may  be  used  in 
practical  shop  related  courses,  and 
instructors  who  arc  not  job  experienced 
may  be  used  in  lecture  courses  (Askren  & 
Valentine,  1970). 

3.  Some  of  the  newer  Army  marksmanship 
training  methods  arc  superior  to  the  older, 
standard  methods  (McFann,  Buchanan, 
Lyons.  Ward,  &  Waits.  1958;  Olnis'cad, 
1968). 

4.  The  bv  -fits  of  simulator  training  arc  vari¬ 
able  and  seem  to  be  dependent  on  a  multi¬ 
plicity  of  factors. 

5.  CAI,  in  the  overall,  seems  to  be  a  cost- 
effective  training  technique. 

6. Students  indicate  a  preference  for 
traditional  lectures  over  TV  instruction 
(Fishman,  Keller,  &  Atkinson.  1968). 

7.  Variations  in  the  fidelity  of  a  trainer  seem 
to  produce  no  observable  performance 
differences. 

8.  Accelerated  training  is  successful  for  higli- 
aptitude  students  in  avionics  fundamentals 
training  (Johnson  &  Sa'op.  1968). 

9.  NCO  leadership  training  resulted  in  im¬ 
proved  leader  behavior  over  a  no-training 
group  (Hood,  Showcl,  &  Stewart,  1967). 

10.  Fuel  conservation  training  can  reduce  fuel 
consumption  in  drivers  of  the  M48  tank 
(Olson  &  Bacrman,  1955). 

11.  A  programmed  book  reading  instruction 
course  produces  greater  long-term  improve¬ 
ment  than  machine  training  (Jones  & 
Carran.  1965). 

There  has  also  been  considerable  recent  concern 
with  low-aptitude  individuals  who,  generally,  can 
perform  many  skilled  tasks  adequately  when  given 
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proper  training.  They  tend  to  be  slower  learners 
and  retain  knowledge  best  when  taught  by 
practical  rather  than  highly  verbal  means. 

Finally,  systematic  approaches  to  evaluation  and 
course  development  are  beginning  to  receive  some 
emphasis.  These  attempt  to  account  for  almost  all 
of  the  variables  that  can  affect  training  and 
student  behavior.  Most  systems  begin  with  a  job 
analysis  in  order  to  derive  a  list  of  behaviorally 
oriented  job  requirements  from  which  training 


objectives  can  be  formulated.  Many  writers 
advocate  a  pre-training  appraisal  of  the  entering 
students  in  order  to  direct  them  to  the  training 
method  which  is  most  suited  to  their  needs  and 
abilities.  Criterion-referenced  tests  and  other 
measures  of  student  behavior  are  then  constructed 
in  order  to  reflect  the  training  objectives.  Finally, 
after  training,  the  students  and  the  training 
program  arc  evaluated  through  various  means. 
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